DEV Community

Priscila Gutierres
Priscila Gutierres

Posted on • Updated on

PCA & K-Means clustering made simple (with the math behind it) using iris dataset

PCA

The great tutorial made by Jonathon Shlens explain the basis of it, deriving the Linear Algebra motivation. It comes from the Spectral Theorem, as the SVD theorem being a consequence of it. A consequence of this choice is that when using this algorithm, we assume that our problem is linear. The final decomposition will throw an orthogonal basis.
The theorem is enunciated below:

Image description
The fact that the SVD can be derived from the assumption of linearity justifies the use of PCA for data that is well-modeled by linear relationships.

Other assumption is that larger associated variances
have an important structure to our problem and lower variances just represent some kind of noise, but this is not always true.
There are scenarios where the lower variance components might encapsulate subtle but crucial patterns or nuances within the data. These could be related to rare events, outliers, unbalanced data or complex interactions that are not readily apparent in the high-variance components. Neglecting these lower variance components can lead to a loss of valuable information and potentially inaccurate or incomplete insights.

Image description

It's crucial to approach the interpretation of PCA results with a degree of skepticism and not automatically dismiss the lower variance components as irrelevant. A thorough examination of their potential significance is often advisable, particularly in domains where subtle details matter or where the data exhibits complex, non-linear relationships.
PCA offers a powerful tool for dimensionality reduction and data exploration, but it's important to be mindful of its inherent assumptions and limitations.

K-Means Clustering

Next up is K-Means, a clustering algorithm acting as some "flower sorting machine", automatically grouping the given flowers based on their similarities, and helping us to discover patterns we might have missed.

Image description

One of the many clustering techniques used is K-Means, which partitions a dataset into k distinct groups by minimizing the sum of the squares of the distances between the data points and the respective center of each of them, named centroids. We may say that a successful clustering result in a group of points that are related to each other in our dataset.
Summarizing the algorithm, we initialize all the given points by creating k empty clusters and normalizing all the following points, and random centroids are associated within each cluster.
Then, we assign to each cluster the points whose centroid is closest, recalculating the following centroid so that it is at the average of each of the created clusters.
These steps are repeated until a predefined maximum number of iterations or when they no longer move.

Cluster analysis aims to identify groups of objects with similarities in order to discover a distribution of patterns in datasets.
The clustering problem is characterized by discovering groups and identifying distributions and patterns in the data. In this process, there are no predefined classes and desirable relationships that have to be valid between the data being mined.

Note that here we are not using a method like Elbow to estimate the number of clusters because we have the previous information telling us that are 3 different classes of flowers.
Although in this specific case we have prior knowledge about the number of flower classes, in many real scenarios this information is not available. In these cases, techniques such as the Elbow method or other cluster evaluation metrics, such as the silhouette coefficient, can be used to estimate the ideal number of clusters.

Evaluating the results

And how to evaluate our result? To visualize all the groups within their properties, we extracted the original features creating bar charts to see the mean value of them.

After running K-means and assigning each data point to a cluster, we calculate the average (mean) value for each feature within each cluster. This give us a summary of the typical characteristics of the data points within each group we got.
Then, ploting a bar chart where each cluster is represented by a group of bars and where each bar within a group corresponds to a specific feature, we may start analysing our results. Its height represents the average value of that feature for that cluster.

If the clusters are well-separated, you should see clear differences in the bar heights between clusters for at least some of the features. This indicates that the clusters capture distinct patterns in the data.

Image description

This method is most effective when the features are numerical and have meaningful interpretations.
If the features are categorical or have vastly different scales, other visualization techniques or evaluation metrics might be more appropriate.

To illustrate all the above concepts, I created this notebook using iris dataset as a complete example.

Top comments (0)