How many types of Cluster Analysis and Techniques using R

Index Table

  • Definition
  • Types
  • Techniques to form cluster method

Definition:

  1. It groups the similar data in same group.
  2. The goal of this procedure is that the objects in a group are similar to one another and are different from the objects in other groups.
  3. Greater the similarity within a group and greater difference between the groups, more distinct the clustering.
  4. Cluster analysis provides a potential relationship and construct systematic structure in large number of varables and observations.

Main objectives of clustering are:

  1. Intra-cluster distance is minimized.
  2. Inter-cluster distance is maximized.

clusterdistances

Types:

  1. Hierarchical clustering: Also known as ‘nesting clustering’ as it also clusters to exist within bigger clusters to form a tree.
  2. Partition clustering: Its simply a division of the set of data objects into non-overlapping clusters such that each objects is in exactly one subset.
  3. Exclusive Clustering: They assign each value to a single cluster.
  4. Overlapping Clustering: It is used to reflect the fact that an object can simultaneously belong to more than one group.
  5. Fuzzy clustering: Every objects belongs to every cluster with a membership weight that goes between 0:if it absolutely doesn’t belong to cluster and 1:if it absolutely belongs to the cluster.
  6. Complete clustering: It perform a hierarchical clustering using a set of dissimilarities on ‘n’ objects that are being clustered. They tend to find compact clusters of an approaximately equal diameter.

Techniques to form cluster method: 

  • K-means
  • Agglomerative hierarchical clustering
  • DBSCAN.

Here in this article we will learn K-means clustering using R 

K-means: 

K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In ‘k’ means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:

  • Reassign data points to the cluster whose centroid is closest.
  • Calculate new centroid of each cluster.

These two steps are repeated till the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids

Exploring the data

The iris dataset contains data about sepal length, sepal width, petal length, and petal width of flowers of different species. Let us see what it looks like:

library(datasets)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

After a little bit of exploration, I found that Petal.Length and Petal.Width were similar among the same species but varied considerably between different species, as demonstrated below:

library(ggplot2) //this command will load the graphical package 
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

Here

‘iris’: is the name of dataset

‘Petal.Length, Petal.Width’: are properties of species

‘color=species’: means different species will be in different color

geom_point(): this means output will be shown in dots.

here is graphiris_ggplot2

Here in this plot you can see the length and width of different species is almost same.

Clustering

Okay, now that we have seen the data, let us try to cluster it. Since the initial cluster assignments are random, let us set the seed to ensure reproductibility.

set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
irisCluster
K-means clustering with 3 clusters of sizes 50, 52, 48

Cluster means:
 Petal.Length Petal.Width
1 1.462000 0.246000
2 4.269231 1.342308
3 5.595833 2.037500

Clustering vector:
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [75] 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3 3
[112] 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
[149] 3 3

Within cluster sum of squares by cluster:
[1] 2.02200 13.05769 16.29167
 (between_SS / total_SS = 94.3 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault" 
> 

Since we know that there are 3 species involved, we ask the algorithm to group the data into 3 clusters, and since the starting assignments are random, we specify nstart = 20. This means that R will try 20 different random starting assignments and then select the one with the lowest within cluster variation.
We can see the cluster centroids, the clusters that each data point was assigned to, and the within cluster variation.

Let us compare the clusters with the species.

table(irisCluster$cluster, iris$Species)
     setosa versicolor virginica
 1     50        0         0
 2      0       48         4
 3      0        2        46

As we can see, the data belonging to the setosa species got grouped into cluster 1, versicolor into cluster 2, and virginica into cluster 3. The algorithm wrongly classified two data points belonging to versicolor and six data points belonging to virginica.

We can also plot the data to see the clusters:

irisCluster$cluster <- as.factor(iriscluster$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = iriscluster$cluster)) + geom_point()

Here is the plot:ggplot2_2
plot2

That brings us to the end of the article. I hope you enjoyed it! If you have any questions or feedback, feel free to leave a comment

Advertisements

2 thoughts on “How many types of Cluster Analysis and Techniques using R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s