Unsupervised Machine Learning- Part One

Unsupervised learning techniques are valuable tools for explanatory analysis. They bring out patterns and structure within datasets, yielding information that may be information in itself or serve as a guide to further study.

We will begin by reviewing Principal Component Analysis (PCA), a fundamental data manipulation technique with a range of dimensionality reduction applications. For those who need a refresher, Dimensionality reduction techniques are essential for managing and analyzing high-dimensional data sets.

Principal Component Analysis (PCA)

Purpose- This is a powerful decomposition technique, it allows one to break down a highly multivariate dataset into a set of statistically independent components. PCA works by successively identifying the axis of greatest variance in a dataset. The steps are as follows

Identifying the center point of the dataset.
Calculating the covariance matrix of the data.
Calculating the eigenvectors of the covariance matrix.
Orthonormalizing the eigenvectors.
Calculating the proportion of variance represented by each eigenvector.

There are a lot of big words here, so it is important to unpack the following concepts:

Covariance: This is simply the variance between two or more variables. A single value can capture the variance in one dimension or variable, it is necessary to use a 2 by 2 matrix to capture the covariance between two variables, a 3 by 3 matrix, to capture the covariance between three variables, and so on.

Eigenvector: This is a vector that is specific to a dataset and linear transformation. It is a vector that does not change in direction after the transformation is performed. Think of this as holding a rubber band, straight, between your hands. The eigenvector is the vector that did not change in direction before the stretch of the rubber band and during the stretch of the rubber band.

Orthogonalization: This is the process of finding two orthogonal vectors (at right angles) to one another.

Orthonormalization is an orthogonalization process that also normalizes the product.

Eigenvalue (roughly corresponding to the length of the eigenvector) is used to calculate the proportion of variance represented by each eigenvector. This is done by dividing the eigenvalue for each eigenvector by the sum of eigenvalues for all eigenvectors. In conclusion, the covariance matrix is used to calculate the Eigenvectors. An orthonormalization process is undertaken that produces orthogonal, normalized vectors from the Eigenvectors. The eigenvector with the greatest eigenvalue is the first principal component with successive components having smaller eigenvalues. In this way, the PCA algorithm has the effect of taking a dataset and transforming it into a new, lower-dimensional coordinate system.

Using PCA

Now that we have gotten the high-level algorithm out of the way, we can apply PCA to a key Python dataset. We will be using the UCI handwritten digits dataset from Scikit-learn. This dataset comprises 1797 instances of handwritten digits gathered from 44 different writers. The input is pressure and location from these authors is resampled twice across an 8 by 8 grid.

import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
import matplotlib.cm as cm

# Load the digits dataset
digits = load_digits()
data = digits.data

# Get the number of samples, features, and unique labels
n_samples, n_features = data.shape
n_digits = len(np.unique(digits.target))
labels = digits.target

# Perform PCA to reduce to 10 components
pca = PCA(n_components=10)
data_r = pca.fit_transform(data)

# Print explained variance ratio
print('Explained Variance Ratio (First Two Components): %s' % str(pca.explained_variance_ratio_[:2]))
print('Sum of Explained Variance (First Two Components): %s' % str(sum(pca.explained_variance_ratio_[:2])))

# Create a scatter plot
plt.figure()
colors = cm.rainbow(np.linspace(0, 1, n_digits))  # Use n_digits for color mapping
for i in range(n_digits):
    plt.scatter(data_r[labels == i, 0], data_r[labels == i, 1],
                color=colors[i], alpha=0.4, label=str(i))  # Fix the variable name and indexing
plt.legend()
plt.title('Scatterplot of Points in the First 10 Principal Components')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

Explained Variance Ratio (First Two Components): [0.14890594 0.13618771]

Sum of Explained Variance (First Two Components): 0.2850936482369929

The explained variance ratio for the first two principal components indicates that the first component captures approximately 14.89% of the total variance in the dataset, while the second component captures about 13.62%. Together, these two components explain around 28.51% of the total variance.

This suggests that, although the first two components provide some insight into the data set, approximately 71.49% of the dataset remains unexplained. This indicates that these two components may not fully capture the underlying patterns of the dataset. To achieve a more comprehensive understanding, it would be beneficial to explore additional principal components, as a higher cumulative variance (typically between 70-90%) is often desired for effective dimensionality reduction.

This plot indicates that there is some separation between classes in the first two principal components, but achieving high accuracy in classification might be challenging with this dataset. However, the classes seem to form clusters, suggesting that we could achieve decent results using clustering analysis. PCA has provided us with a better understanding of the dataset's structure, guiding our next steps. Now, let's use this insight to explore clustering with the k-means algorithm.

Education, Execution, and Consistency