Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Sep;73(3):811-821.
doi: 10.1111/biom.12647. Epub 2017 Jan 18.

Statistical significance for hierarchical clustering

Affiliations

Statistical significance for hierarchical clustering

Patrick K Kimes et al. Biometrics. 2017 Sep.

Abstract

Cluster analysis has proved to be an invaluable tool for the exploratory and unsupervised analysis of high-dimensional datasets. Among methods for clustering, hierarchical approaches have enjoyed substantial popularity in genomics and other fields for their ability to simultaneously uncover multiple layers of clustering structure. A critical and challenging question in cluster analysis is whether the identified clusters represent important underlying structure or are artifacts of natural sampling variation. Few approaches have been proposed for addressing this problem in the context of hierarchical clustering, for which the problem is further complicated by the natural tree structure of the partition, and the multiplicity of tests required to parse the layers of nested clusters. In this article, we propose a Monte Carlo based approach for testing statistical significance in hierarchical clustering which addresses these issues. The approach is implemented as a sequential testing procedure guaranteeing control of the family-wise error rate. Theoretical justification is provided for our approach, and its power to detect true clustering structure is illustrated through several simulation studies and applications to two cancer gene expression datasets.

Keywords: High-dimension; Hypothesis testing; Multiple correction; Unsupervised learning.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Hierarchical clustering applied to 5 observations. (A) Scatterplot of the observations. (B) The corresponding dendrogram. This figure appears in color in the electronic version of this article.
Figure 2
Figure 2
The SHC testing procedure illustrated using a toy example. Testing is applied to the 96 observations joined at the second node from the root. (A) Scatterplot of the observations in ℝ2. (B) The corresponding dendrogram. (C) Hierarchical clustering applied to 1000 datasets simulated from a null Gaussian estimated from the 96 observations. (D) Distributions of null cluster indices used to calculate the empirical SHC p-values. This figure appears in color in the electronic version of this article.
Figure 3
Figure 3
Analysis of gene expression for 337 BRCA samples. (A) Heatmap of gene expression for the 337 samples (columns) clustered by Ward’s linkage. (B) Dendrogram with corresponding SHC p-values and α* cutoffs given only at nodes tested according to the FWER controlling procedure at α = 0.05. This figure appears in color in the electronic version of this article.

Similar articles

Cited by

References

    1. Baik J, Silverstein JW. Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis. 2006;97:1382–1408.
    1. Bastien RRL, Martín M. PAM50 breast cancer subtyping by RT-qPCR and concordance with standard clinical molecular markers. BMC Medical Genomics. 2012;5:44. - PMC - PubMed
    1. Bhattacharjee A, Meyerson M. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. PNAS. 2001;98:13790–13795. - PMC - PubMed
    1. Borysov P, Hannig J, Marron JS. Asymptotics of hierarchical clustering for growing dimension. Journal of Multivariate Analysis. 2014;124:465–479.
    1. Efron B, Halloran E, Holmes S. Bootstrap confidence levels for phylogenetic trees. PNAS. 1996;93:13429–13434. - PMC - PubMed
  NODES
INTERN 1
twitter 2