Heirarchical Clustering Techniques using R

The idea behind hierarchical cluster analysis is to show which of a (potentially large) set
of samples are most similar to one another, and to group these similar samples in the same
limb of a tree.

Each of the samples can be thought of a sitting in an m-dimensional space, defined by the m variables (columns) in the dataframe. We define similarity on the basis of the distance between two samples in this m-dimensional space.

Several different distance measures could be used, but the default is Euclidean distance and this is used to work out the distance from every sample to every other sample.

for the other options, check

?dist

This quantitative dissimilarity structure of the data is stored in a matrix produced by the “dist function”.

Initially, each sample is assigned to its own cluster, and then the hclust algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.

for more details about hclust function, check

?hclust

Here in this example we will cluster the similar countries on the basis of similarity. so the decision making can be easier. In order to find the similarities in observation and group the data, we need to perform cluster analysis.

country area gdp inflation life expect military pop growth unemployment
Austria 83871 41600 3.5 79 0.8 0.5 4
Belgium 95326 37589 3.5 78 1.3 0.4 2
Bulgaria 56356 13456 2.6 78 2.3 0.3 3
Crotia 73569 18000 4.5 79 1.5 0.2 5
czech Republic 43568 27156 4 78 1.6 -2 1
denmark 338155 37256 2 56 4 2 1.5
Estonia 152632 20156 3 78 2 1.9 4
Germany 132562 36252 4.9 74 2 1.8 3
Hungary 93265 38265 5.9 69 3.1 1.5 3.5
Iceland 100000 25655 1.5 65 4 1.2 3.6
Italy 70125 19654 2.8 86 2 -0.8 2.5
Latvia 302325 38569 3.6 72 1.2 1.9 4
Lithuansia 64523 40256 5.6 88 1.3 -1.5 4.01
Luxemberg 65235 32565 4.5 98 1.5 1.6 1.8
Netherland 41256 12568 2.6 67 1.4 0.6 2.5
Norway 326598 19568 7.2 73 1.69 0.3 1.23
Portugal 312654 18652 1.53 74 2.6 -1.2 1.6
Slovakia 92356 45895 0.26 72 3.1 0.6 5
Slovenia 49265 123654 2.25 75 1.5 0.5 6
Spain 20125 26651 23.5 76.5 2 0.5 4.2
Sweden 502354 21561 26.2 86.3 1.9 -0.2 2.356
Switzerland 495632 125465 56 56.9 1.8 0.003 1.8

In this example we will use hierarchical cluster analysis to group the countries. This cluster analysis also allows us to summarise the data by grouping all the similar observation into different clusters. These observations are made by considering similar values for number of variables.  i.e. if the eucladien distance between two values is less than they are group together we can perform cluster analysis with the dist and hclust function.

dist function:- calculates a distance matrix of the provided values and provides the eucledian distance between those values by default. from the calculated eucladien distance hierarchical clustering can be derived, to perform this we use the hclust function.

The hclust function has a method attributes that specifies hows the clustering is to be done. The method includes average, gord, single, median complete and centroid methods. The complete linkage method being the default.

Steps to make hierarchical clustering in R

step1). First we load the dataset in R workspace and saved it in variable name- data

survey<-read.csv("survey.csv", header=TRUE)

step2). the syntax to perform hierarchical cluster is hclust of dist of dataset name

surveyclust<-hclust(dist(survey[-1]))

Saving the hclust in variable name surveyclust

-1 is to remove the first column i.e country name, since it does not have logical relationship with the data.

step3). plot the denddogram

plot(surveyclust)

clustering result variable to plot the dendrogram

hclust

the numbers you are seeing on dendrogram plot is country in the table

countries are plot based on their similarities

step4). we can also make clusters from these dendrograms using

rect.hclust(survey, 5)

model name and number of argument(number of clusters)

the dendrogram now will show 5 clusters in color.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s