Statistical cluster analysis is a Exploratory Data Analysis Technique which groups heterogeneous objects (M.D.) into homogeneous groups. We will learn the basics of cluster analysis with mathematical way.
Cluster Analysis can be done by two methods:
- Hierarchical cluster analysis.
- Non-Hierarchical cluster analysis.
Hierarchical cluster Analysis(HCA):
- In HCA, the observation vector(cases) are groups together on the basis of their mutual distance.
- A HCA is usually visualised through a hierarchical tree called dendrogram tree. This hierarchical tree is a nested set of partitions represented by a tree diagram.
Characteristics of HCA:
- Sectioning a tree at a particular level produces a partition into ‘g’ disjoint groups.
- If 2 groups are chsen from different partitions then either the groups are disjoint or 1 group is totally contained within the other.
- A numerical value is associated with each partition up the tree where branches join together. This value is a measure of distance or dissimilarity between two merged clusters.
- Different distance measures give rise to different hierarchical clusters structure.
There are two types of approaches for HCA:
- Agglomerative HCA
- Divisive HCA
- Operates by successive merges of cases.
- Begin with a clusters, each containing single cases.
- At each stage merge the 2 most similar group to form a new cluster, this reducing the number of cluster by n.
- Continue till(eventually as similarity decreases) all subgroups are fused to form one single cluster.
- Divisive method operates by successive splitting of groups.
- Initially starts with a single group(i.e. one single cluster).
- Group is divided into 2 types: 1) The objects in one subgroup are as far as possible from the objects in the other group. 2) Continue till there are ‘n’ groups, each with a single cluster.
Note: Result of both the approaches are displayed through the dendrogram tree.
Steps Involved in Agglomerative HCA:
- Starts with a cluster each containing a single object and an NxN symmetric matrix of distances(or similarity). D = ((D[i×j]))
- Search the distance matrix (D) for nearest (most similar) pair of objects. Let the distance between the most similar cluster say (U&V) be denoted by d[u×v] .
- Merge clusters U & V to be as (U,V) as the new cluster(produces (n-1)×(n-1) matrix), update the distance matrix by doing following:
- Deleting the rows & columns corresponding to the clusters U & V.
- Adding a row & a column giving the distances between the newly formed cluster (U,V) and the remaining cluster.
- Repeat points second & third a total (n-1) times. Record the identity of clusters that are merged and the level(distance or dissimilarity) at which they are merged.
- Structire the dendrogram tree fromthe information on mergers and merger levels.
Possible distance measures between two clusters:
- Single linkage-minimum distance or nearest neighbour approach
Here i∈k1, j∈k2
Distance between cluster 1 &2 ?
Under single linkage approach min[d(1,2),(3,4,5)] = d(2,5)
Here is the example of single linkage attached in pdf
- Complete Linkage – max distance between cluster
d(1,3), d(1,4),d(1,5) | d(2,3), d(2,4),d(2,5)
complete linkage distance between cluster 1 and 2 = d(1,4)
Here is the complete linkage example attached
- Average linkage – average distance
Average linkage distance between clusters
=1/6∑d(i,j) where i,j is 1 to n
- Centroid Linkage: Distance between centroids of two clusters.
- Median linkage: Distance between median of two clusters.
Hierarchical cluster analysis ends here, in the next tutorial article I will explain Non-Hierarchical cluster analysis.
Till then stay tuned and keep visiting for learning tutorials which you won’t get anywhere.
If you have any doubts please mention in comments or shoot me an email @ firstname.lastname@example.org.