:::: MENU ::::

Wednesday, March 3, 2021

What is Clustering:

Clustering is the unsupervised classification technique in the Machine Learning field. It is a process of partitioning a similar set of data (or object) in a set of meaningful sub-classes, called clusters when there is no prior knowledge about subclass or cluster. It doesn’t require any training labelled datasets. But method or rules of calculating similarity or closeness must require partitioning into the subclass. It helps to understand the natural grouping or structure in a data set.



Typical applications of Clustering Algorithms:

• As a stand-alone tool to get insight into data distribution

   As a preprocessing step for other algorithms

 

Characteristics of a good cluster:

  •   Intra-class the similarity is high.
  •    Inter-class the similarity is low.
  •    The quality of a clustering result depends on both the similarity measure used and its implementation.


Requirements for good clustering algorithms:

Ø  Dealing with different types of attributes such as Numeric and categorical.

Ø  be able to discover of cluster with arbitrary shape.

Ø  Determine the input-variable or parameter based on domain knowledge

Ø  able to deal with noise and outliers.

Ø  should be insensitive to the order of input records of dataset.

Ø  can be able to handle high dimensionality

Ø  can be incremental for dynamic change.

Note: Not all clustering algorithms can deal with all these requirements perfectly.



Classification of Clustering Algorithms:

Clustering methods can be classified into different approaches:

·         Partitioning algorithms

·         Hierarchical algorithms

·         Density-based methods


1. Partitioning algorithms:

It is a clustering method used to partition data-point of the dataset into subclass or clusters based on their similarity called measured function [6]. This algorithm requires dataset D and has to specify the number of clusters, K to form. Hence more similar data point is in the same cluster.

If the measured function is a distance-based function, then it can be said that the sum of squared distances is minimized into a cluster.


Fig.1. distance function for partitioning algorithm

where ci is the centroid or medoid of the cluster Ci and p is a point of the cluster Ci, dist(x,y) is the Euclidean distance function where p, ci can be multi-dimensional. The commonly used partitioning clustering algorithms are

§  K-means clustering

§  K-medoids clustering or PAM (Partitioning Around Medoids)




K-means Clustering  Algorithm code:

# k-means cluster library from sklearn
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs

# dataset creation using make_blobs function of sklearn
x_data,y_true = make_blobs(n_samples=1500, centers=4, n_features=2, shuffle=True, random_state=0)


# Visualization of dataset
fig,ax = plt.subplots(figsize=(10,5))
ax.set_title("datasets")
scatter = ax.scatter(x_data[:, 0],x_data[:, 1], s=50, cmap='viridis');
plt.xlabel("feature 1")
plt.ylabel("feature 2")
plt.show()



kmeans = KMeans(n_clusters=4,init='k-means++',max_iter=200, n_init=20# build a model passing n_clusters=4 which is the number of clusters.
kmeans.fit(x_data) # fitting unlabel dataset
y_kmeans = kmeans.predict(x_data) # predict the class based on built-model
center_kmeans = kmeans.cluster_centers_ # return the final cluster-center



Fig. dataset after applying a k-means clustering algorithm

Note: To see the comparison of the result between self implementation k-medoids clustering algorithms and sklearn k-medoids functions, can check this link.



2. Hierarchical algorithms:

Hierarchical clustering is also known as hierarchical cluster analysis or HCA which is a method of partitions the given dataset into groups through building a data-point tree. Unlike partitioning clustering algorithms, it doesn’t require the pre-defined k, the number of clusters as input. Usually, the representation of the tree is called a dendrogram. In general, there are two types of hierarchical clustering method like bottom-up (Agglomerative) and top-down (Divisive). Steps of those methods are greedy manner.

Fig.2: example of dendrogram where A, B, C, D, E, F are six datapoints.



      3. Density-based Clustering Algorithm:

Density-Based Clustering identifies distinctive clusters in the data, based on the dense area of data points. It tries to determine clusters of arbitrary shapes with noise. Hence it is applicable with noisy and outlier data. 
DBSCAN is one of the popular density-based clustering algorithms.




Summary:

1. Clustering algorithm definition

2. Clustering algorithm purpose

3. requirement of the algorithm

4. classification of clustering algorithm

5. implementaion of K-means clustering algorithm using sklearn

0 comments:

Post a Comment