K-means clustering is an unsupervised machine learning algorithm used to group similar data points together based on their features. It is a partition-based clustering algorithm that aims to divide a dataset into K distinct clusters, where K is a predefined number.
The algorithm works as follows:
- Initialization: Randomly select K data points from the dataset as initial cluster centroids.
- Assignment: For each data point, calculate the distance to each centroid and assign the data point to the cluster with the nearest centroid. This step is based on the concept of minimizing the squared Euclidean distance.
- Update: Recalculate the centroids of each cluster by taking the mean of the data points assigned to that cluster.
- Repeat Steps 2 and 3 until convergence: Iterate the assignment and update steps until there is no significant change in cluster assignments or a predefined maximum number of iterations is reached.
The final result of the K-means algorithm is a set of K cluster centroids and the assignment of each data point to one of the clusters.
K-means clustering has a few important properties and considerations:
- Number of clusters (K): The number of clusters is predefined and needs to be specified before running the algorithm. Selecting an appropriate value for K is important and can be determined using domain knowledge or through techniques like the elbow method or silhouette analysis.
- Convergence and initialization: The algorithm may converge to a local minimum rather than the global minimum. The results can vary depending on the initial centroid positions. To mitigate this, the algorithm can be run multiple times with different initializations, and the best result can be selected based on a predefined criterion such as the lowest total within-cluster sum of squares.
- Scaling: It is often recommended to scale the features before applying K-means clustering to ensure that each feature contributes proportionally to the distance calculations.
- Cluster evaluation: Assessing the quality of the clusters generated by K-means can be subjective. Evaluation methods such as silhouette coefficient, within-cluster sum of squares, or domain-specific metrics can be used.
K-means clustering is widely used in various domains, including customer segmentation, image segmentation, document clustering, and anomaly detection. However, it has limitations, such as sensitivity to initial centroids, the need to specify the number of clusters, and difficulty handling non-linear or irregularly shaped clusters. Other clustering algorithms like DBSCAN and hierarchical clustering can be considered as alternatives in such cases.
from sklearn.cluster import KMeans from sklearn.datasets import make_blobs import matplotlib.pyplot as plt # Generate sample data X, _ = make_blobs(n_samples=200, centers=4, random_state=0) # Create a K-means clustering model kmeans = KMeans(n_clusters=4) # Fit the model to the data kmeans.fit(X) # Get the cluster labels and centroids labels = kmeans.labels_ centroids = kmeans.cluster_centers_ # Plot the data points and cluster centroids plt.scatter(X[:, 0], X[:, 1], c=labels) plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', color='red') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title('K-means Clustering') plt.show()
In this example, we first generate sample data using the
make_blobs function from scikit-learn. The
make_blobs function creates synthetic data with multiple clusters.
We then create an instance of the
KMeans class and specify the number of clusters (
n_clusters) as 4. This means we want the algorithm to group the data into 4 clusters.
Next, we fit the K-means model to the data by calling the
fit method. This will assign each data point to one of the clusters based on its proximity to the cluster centroids.
After fitting the model, we obtain the cluster labels for each data point using the
labels_ attribute, and we obtain the coordinates of the cluster centroids using the
Finally, we plot the data points with different colors representing the assigned clusters, and we plot the cluster centroids as red crosses.
Remember to have scikit-learn and matplotlib installed (
pip install scikit-learn matplotlib) before running this code.
K-means clustering is an iterative algorithm that seeks to minimize the sum of squared distances between data points and their cluster centroids. It may converge to a local minimum, so the results can vary depending on the initial configuration. It is important to choose an appropriate value for K and be cautious when interpreting the results.