When creating machine learning models, there are typically two paths to choose from: supervised and unsupervised learning. Simply, the difference between these two methods is whether we know the output labels.
For example, let's say that we want to build a model that can identify pneumonia from chest x-rays. In this case, for each photo we feed into the model, we know beforehand whether the x-ray is of a pneumonia-positive person. Because we know the output labels of each input beforehand, we would use supervised learning, which aims to measure a relationship between inputs and known outputs.
Now in a different example, hypothetically, let's say that we have data (average speed, total accidents, total tickets, etc.) on many drivers and we want to put these drivers into groups where they are most similar to each other. In this case, we don't have initial output labels (eg. good driver, bad driver, etc.) and have to infer what kind of groups there are after they are made. In this case, we would use unsupervised learning.
Let's go into more detail for each approach.
Supervised learning uses labeled data to train a model to classify inputs or predict outcomes more accurately. Because we are feeding labeled data into the model, we are able to test and improve the model validity by verifying how accurate a model is over time.
With supervised learning, there are usually two types of problems to use it for: classification and regression.
A classification problem entails separating the data into pre-determined groups. For example, one could classify whether an animal is a cat or a dog based on size, weight, etc. Some common classification algorithms include support vector machines, random forest, and gradient boost.
A regression problem aims to identify the relationship between the independent and dependent variables. For example, a project aiming to project a store's ice cream sales using number of flavors, hours open, etc. would use regression, which could be linear, nonlinear, or logistic, to name a few.
Unsupervised learning finds groups or patterns using unlabeled data. Because there aren't any labels, there is not a specific way to verify model validity like with supervised learning. Common problems include clustering and dimensionality reduction.
A clustering problem aims to separate the data into distinct groups by identifying patterns or similarities between data points. For example, an online retail store may want to separate its customers into different demographics. A common clustering algorithm would be k-means clustering, which groups the data into k groups using the mean distance of each point to a cluster centroid.
Dimensionality reduction is used when there are too many features in a given dataset. It will reduce the number of features in a dataset while keeping its integrity and is done before the modeling stage.
Deciding whether to use supervised or unsupervised learning comes down to a few factors, those being, determining whether your data is labeled and what kind of modeling are you trying to accomplish.