Unsupervised Learning (Part-3)

 K-means : In Continuation to previous post, the complexity of K means is O(tkn) where t is  the no. of the iterations, k is no. of clusters and n is no. of objects. When k value is large it may become inefficient but is efficient than many other algorithms. It can lead to local optimum solution for various random k values. By using deterministic annealing and genetic algorithm can lead to good local optimum.

Weakness of K means : 

  • Need to specify k, number of clusters in advance.
  • Applicable only when mean is defined.
  • Not suitable to discover clusters with non-convex shapes.

Choosing Initial Centroid :

  • Choose one centroid.
  • Pick point furthest from existing centroids as centroid.
In this, we don't know when to stop.

Hierarchical Clustering: Here, we will have information loss due to its height or height increases loss of information increases. Here, choose a sample of data, perform hierarchical clustering, extract k - cluster, use centroid as initial centroids for k-means.

Empty Clusters : In this, no instance is assigned to the cluster. For the solution of the same, choose point furthest away from its centroid, choose the point from the cluster with maximum SSE. In both of these, the cluster will be split and SSE will reduce. Minimize intracluster distance and maximize intercluster distance.

Techniques for optimal k :

  • Elbow Curve : The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and then for each value of k computes an average score for all clusters. By default, the distortion score is computed, the sum of square distances from each point to its assigned center. The Elbow method looks at the total WSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t improve much better the total WSS.
  • Silhouette Score : It computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.
  • Gap Statistic : The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e, that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.
                                                     




 

Comments

Popular posts from this blog

Model Evaluation and Selection

Convolutional Neural Networks(Part-4)

Introduction to Machine Learning(Part-4)