Hierarchical Clustering

5 min readApr 23, 2021

Hasn’t specifying the number of clusters in KNN and K-mediods been a pain, no worries because now we have Hierarchical clustering to save us from the mess. Another added advantage for this is ability to visualize the construction of clusters diagrammatically.

Let us first introduce tree based representation Dendograms which make things beautiful for Hierarchical clustering.

Conclusions about the proximity of two observations should not be implied by their relationship on the horizontal axis nor by the vertical connections. Rather, the height of the branch between an observation and the clusters of observations below them indicate the distance between the observation and that cluster it is joined to. For example, consider observation 9 & 2 in figure below. They appear close on the dendrogram (right) but, in fact, their closeness on the dendrogram imply they are approximately the same distance measure from the cluster that they are fused to (observations 5, 7, & 8). It by no means implies that observation 9 & 2 are close to one another.

2. To obtain number of clusters at end, we draw a horizontal axis through y axis. The height of the cut to the dendrogram controls the number of clusters obtained. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster(node).

In the below example, the best choice of no. of clusters will be 4 as the red horizontal line in the dendrogram below covers maximum vertical distance AB.

3. And cherry on the cake, you can get correlation(“Baker” or “Cophenetic” correlation matrix) between different dendograms(different may be due to linkage method) of same data.

Hierarchical clustering can be divided into two main types:

Agglomerative clustering: AGNES (AGglomerative NESting) works in a bottom-up manner. That is, each observation is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar(closer) are combined into a new bigger cluster (nodes). This procedure is iterated until all points are a member of just one single big cluster (root). The result is a tree which can be displayed using a dendrogram.
Divisive hierarchical clustering: DIANA (DIvise ANAlysis) works in a top-down manner. DIANA is like the reverse of AGNES. It begins with the root, in which all observations are included in a single cluster. At each step of the algorithm, the current cluster is split into two clusters that are considered most heterogeneous. The process is iterated until all observations are in their own cluster.

Note that agglomerative clustering is good at identifying small clusters. Divisive hierarchical clustering, on the other hand, is better at identifying large clusters.

Similar to k-means, we measure the (dis)similarity of observations using distance measures (e.g., Euclidean distance, Manhattan distance, etc.); the Euclidean distance is most commonly the default. However, a fundamental question in hierarchical clustering is: How do we measure the dissimilarity between two clusters of observations? A number of different cluster agglomeration methods (i.e., linkage methods) have been developed to answer this question. The most common methods are:

Maximum or complete linkage clustering: Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.
Minimum or single linkage clustering: Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.
Mean or average linkage clustering: Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters. Can vary in the compactness of the clusters it creates.
Centroid linkage clustering: Computes the dissimilarity between the centroid for cluster 1 (a mean vector of length pp, one element for each variable) and the centroid for cluster 2.
Ward’s minimum variance method: Minimizes the total within-cluster variance. At each step the pair of clusters with the smallest between-cluster distance are merged. Tends to produce more compact clusters.

Complete linkage and Ward’s method are often preferred for AGNES clustering and mean or average linkage clustering method for DIANA.

Checkout different dendograms for each Linkage method

Points to Note

Don’t forget to standardize variables if you want to avoid blunders.
Although K has not to be input in method yet various decisions like selection of distance metric, linkage method
Even though initially number of clusters isn’t require but at the end we did everything to form clusters. So to know how many is suitable number of clusters we need to decide where to cut our dendogram from.
Hierarchical clustering is not very good for large data.

Resources:

Hierarchical Clustering

Written by Shaily jain