Piyush Bagani

Posted on

# K-Means Clustering

## Introduction

With tons of data being generated every millisecond, it’s no surprise that most of this data is unlabeled. But that’s okay, because there are different techniques available to make do with unlabeled datasets. In fact, there’s an entire domain of Machine Learning called “Unsupervised Learning” that deals with unlabeled data.

## The Concept

Imagine you’re opening a small toy store. You have a stack of different toys, and 3 shelves. Your goal is place similar toys in one shelf. What you would do, is pick up 3 toys, one for each shelf in order to set a theme for every shelf. These toys will now dictate which of the remaining toys will go in which shelf.

Every time you pick a new toy up from the stack, you would compare it with those first 3 toys, and place this new toy on the shelf that has similar toys. You would repeat this process until all the toys have been placed.

Once you’re done, you might notice that changing the number of shelves, and picking up different initial toys for those shelves (changing the theme for each shelf) would increase how well you’ve grouped the toys. So, you repeat the process in hopes of a better outcome.

## K-means Algorithm

K-means clustering is a good place to start exploring an unlabeled dataset. The K in K-Means denotes the number of clusters. K-means algorithm is an iterative algorithm that tries to partition the dataset into subgroups (clusters) where each data point belongs to only one group.

## It has 4 basic steps:

• Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
• Assign datapoints to Clusters
• Update Cluster centroids
• Repeat step 2–3 until the stopping condition is met.

So This was small an explaination of K-Means Clustering Algorithm, Now Let's have a look to it's applications:

## Applications

### Classifying network traffic

Imagine you want to understand the different types of traffic coming to your website. You are particularly interested in understanding which traffic is spam or coming from bots.

What the problem is: As more and more services begin to be used on your application, or as your website grows, it is important you know where the traffic is coming from. For example, you want to be able to block harmful traffic and double down on areas driving growth. However, it is hard to know which is which when it comes to classifying the traffic.

How clustering works: K-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic types. By having precise information on traffic sources, you are able to grow your site and plan capacity effectively.

### Fantasy League Analysis

What is the problem: Who should you have in your team? Which players are going to perform best for your team and allow you to beat the competition? The challenge at the start of the season is that there is very little if any data available to help you identify the winning players.

How clustering works: When there is little performance data available to train your model on, you have an advantage for unsupervised learning. In this type of machine learning problem, you can find similar players using some of their characteristics. This has been done using K-Means clustering. Ultimately this means you can get a better team more quickly at the start of the year, giving you an advantage.

### Spam filter

You know the junk folder in your email inbox? It is the place where emails that have been identified as spam by the algorithm.

What the problem is: Spam emails are at best an annoying part of modern day marketing techniques, and at worst, an example of people phishing for your personal data. To avoid getting these emails in your main inbox, email companies use algorithms. The purpose of these algorithms is to flag an email as spam correctly or not.

How clustering works: K-Means clustering techniques have proven to be an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped together.
These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%. This is excellent news for people who want to be sure they’re not missing out on your favorite newsletters and offers.