The previous blog explored how unsupervised machine-learning algorithms can extract valuable structures and information from large, complex datasets. These algorithms are powerful tools for understanding the structure and content of new or unfamiliar data. Clustering, a technique within unsupervised learning, is particularly effective for handling large datasets due to its speed and the ability to perform multiple clustering operations smoothly and efficiently in polynomial time.
The most popular clustering algorithm is k-means. This algorithm creates a set number of clusters by randomly starting with several points in the data space. Each point represents the center, or mean, of a cluster. An iterative process then follows.
Each point is assigned to a cluster based on the least (within the cluster) sum of squares, which is intuitively the nearest mean.
The center of each cluster becomes the new mean. This causes each of the means to shift.
Over several iterations, the centroids shift to positions that minimize a performance metric, typically the "within-cluster sum of squares." For reference, a centroid is the center point of a cluster in a dataset. When this metric reaches its lowest value, observations stop being reassigned, indicating that the algorithm has converged on an optimal solution. Next, we can go over how we can implement this in the code below.
# Import necessary libraries
from time import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale
from sklearn import metrics
from sklearn.cluster import KMeans
np.random.seed() # Set random seed
# Load and scale the digits dataset
digits = load_digits()
data = scale(digits.data)
# Get dataset dimensions
n_samples, n_features = data.shape
n_digits = len(np.unique(digits.target))
labels = digits.target
sample_size = 300 # Sample size for silhouette score
# Print basic dataset info
print("n_digits: %d, \t n_samples %d, \t n_features %d"
% (n_digits, n_samples, n_features))
print(79 * '_')
print('% 9s' % 'init' ' time inertia homo compl v-meas ARI AMI silhouette')
# Function to benchmark KMeans
def bench_k_means(estimator, name, data):
t0 = time() # Start timer
estimator.fit(data) # Fit model
# Print benchmark metrics
print('% 9s %.2fs %i %.3f %.3f %.3f %.3f %.3f %.3f'
% (name, (time() - t0), estimator.inertia_,
metrics.homogeneity_score(labels, estimator.labels_),
metrics.completeness_score(labels, estimator.labels_),
metrics.v_measure_score(labels, estimator.labels_),
metrics.adjusted_rand_score(labels, estimator.labels_),
metrics.adjusted_mutual_info_score(labels, estimator.labels_),
metrics.silhouette_score(data, estimator.labels_,
metric='euclidean',
sample_size=sample_size)))
# Run KMeans and benchmark it
kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10) # KMeans model
bench_k_means(kmeans, "k-means++", data) # Call the benchmarking function
The results from running the code above are below. Note, I added some customization which is why my results are in tabular form.
With the Scikit-Learn package, we can use the k-means++ algorithm, which will improve over the original k-means algorithms regarding running time and success rate to avoid poor clustering. As we have discussed, the algorithm achieves this by running an initialization procedure to find cluster centroids that approximate minimal variance within classes. In the code, we are using some key metrics to measure how the k-means application is performing and we will be going through what each means in a moment. The definition of success for clustering algorithms is that they provide an interpretation of how input data is grouped that trades off between several factors, including class separation, in-group similarity, and cross-group difference.
Homogeneity score: This is a simple, zero-to-one bounded measure of the degree to which clusters contain only assignments of a given class. A score of one indicates that all the clusters contain measure measurements from a single class. This metric is complimented by the completeness score, which is bounded by the measure of the extent to which all members of a given class are assigned to the same cluster. In our case, we have a homogeneity score of 0.673, which shows that our produced clustering is moderately good.
The validity measure (v-measure): is a harmonic mean of the homogeneity and the completeness scores, which is exactly analogous to the F-measure for binary classification. In other words, it provides a single value scaled between 0-1 to monitor both homogeneity and completeness metrics.
The Adjusted Rand Index(ARI): This is a metric that tracks the consensus between the sets of assignments. It measures the consensus between the true, pre-existing observation labels and the labels predicted as an output in the clustering algorithm. One option to measure the performance of a k-means clustering solution without labeled data is the Silhouette Coefficient. This is a measure of how well-defined the clusters within a model are.
The Silhouette score is at 0.139 which is fairly low, but this does not surprise us because the dataset we are using is handwritten digits data, which is inherently noisy and does tend to overlap. Other scores in our metrics values are not as impressive. The V-measure is 0.692 which is reasonable, but in this case is held back by a poor homogeneity measure, which suggests that the cluster centroids did not resolve perfectly. Also, the ARI at 0.561 is not so great. In the next blog, we will talk about how we can tune our clustering configurations and how we can use PCA again to reduce the dimensionality of our dataset for a better analysis. Thank you for reading thus far!
Sincerely, Frosthash
Education, Execution, and Consistency