Unsupervised Learning(Part-4)

August 17, 2020

BIRCH :

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, BIRCH only requires a single scan of the database.

Algorithm: algorithm takes as input a set of $N$ data points, represented as real-valued vectors, and a desired number of clusters $K$ . It operates in four phases, the second of which is optional.

The first phase builds a clustering feature ( $CF$ ) tree out of the data points, a height-balanced tree data structure, defined as follows:

Given a set of N d-dimensional data points, the clustering feature $CF$ of the set is defined as the triple $CF=(N,{\overrightarrow {LS}},SS)$ , where ${\overrightarrow {LS}}=\sum _{i=1}^{N}{\overrightarrow {X_{i}}}$ is the linear sum and $SS=\sum _{i=1}^{N}({\overrightarrow {X_{i}}})^{2}$ is the square sum of data points.
Clustering features are organized in a CF tree, a height-balanced tree with two parameters: branching factor $B$ and threshold $T$ . Each non-leaf node contains at most $B$ entries of the form $[CF_{i},child_{i}]$ , where $child_{i}$ is a pointer to its $i$ the child node and $CF_{i}$ the clustering feature representing the associated sub cluster. A leaf node contains at most $L$ entries each of the form $[CF_{i}]$ . It also has two pointers previous and next which are used to chain all leaf nodes together. The tree size depends on the parameter $T$ . A node is required to fit in a page of size $P$ . $B$ and $L$ are determined by $P$ . So $P$ can be varied for performance tuning. It is a very compact representation of the dataset because each entry in a leaf node is not a single data point but a sub cluster.

In the second step, the algorithm scans all the leaf entries in the initial $CF$ tree to rebuild a smaller $CF$ tree, while removing outliers and grouping crowded sub clusters into larger ones. This step is marked optional in the original presentation of BIRCH.

In step three an existing clustering algorithm is used to cluster all leaf entries. Here an agglomerative hierarchical clustering algorithm is applied directly to the sub clusters represented by their $CF$ vectors. It also provides the flexibility of allowing the user to specify either the desired number of clusters or the desired diameter threshold for clusters. After this step a set of clusters is obtained that captures major distribution pattern in the data. However, there might exist minor and localized inaccuracies which can be handled by an optional step 4. In step 4 the centroids of the clusters produced in step 3 are used as seeds and redistribute the data points to its closest seeds to obtain a new set of clusters. Step 4 also provides us with an option of discarding outliers. That is a point which is too far from its closest seed can be treated as an outlier.

DBSCAN :

It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region(minPts). It starts with an arbitrary starting point that has not been visited. This point's ε-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized ε-environment of a different point and hence be made part of a cluster.

If a point is found to be a dense part of a cluster, its ε-neighborhood is also part of that cluster. Hence, all points that are found within the ε-neighborhood are added, as is their own ε-neighborhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.

Search This Blog

Data Science

Unsupervised Learning(Part-4)

Comments

Post a Comment

Popular posts from this blog

Model Evaluation and Selection

Convolutional Neural Networks(Part-4)

Graph Analysis(Part-2)