In this series, I will document my understanding of various machine learning concepts.
Machine learning has been defined in different ways. Here are some definitions.
Machine Learning is the science (and art) of programming computers so they can learn from data.
Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.—Arthur Samuel, 1959
Machine learning is where a computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
In supervised learning we use labeled data which includes the solution in the training data. We use the dataset to produce a model that takes feature vectors x also known as predictors as input and outputs the desired solution y.
The most common supervised learning tasks are
regression (predicting values) and classification (predicting classes).
In unsupervised learning, we use unlabeled datasets and the algorithm learns to identify patterns by itself.
Some unsupervised learning algorithms include clustering, anomaly detection and novelty detection, association rule learning, visualization and dimensionality reduction.
In semi-supervised learning, we use labeled and unlabeled data mainly with more unlabeled samples.
It involves supervised and unsupervised learning.
In reinforcement learning, a learning system learns a policy by observing its environment, selecting and performing actions,
and getting rewards or penalties depending on the action chosen. Over time the machine learns to choose the best strategy in a given situation.
Reinforcement learning is mainly used in problems where decision making is sequential, to reach a long-term goal for instance in playing games.
In batch learning, the system does not learn incrementally and it is trained using all the available data.
It is often done offline where the system is trained, and then deployed and runs without learning anymore, applying what it learned offline. This is called offline learning
For the system to learn from new data we train the model from scratch using the full dataset and then deploy the new model.
This may be time-consuming and expensive and ineffective for problems that require regular updates such as predicting stock prices.
In online learning, we train the system incrementally using mini-batches of data such that each learning step is fast and cheap.
This form of learning is effective for systems that receive data continuously and need to change rapidly.
The learning rate, how fast the learning system should adapt to changing data is a critical parameter in online learning.
A high learning rate causes the system to adapt rapidly to new data, but it will also tend to quickly forget the old
A low learning rate, will cause the system to learn more slowly, but it will also be less sensitive to noise in the new data or outliers.
After training, the system should be able to generalize to examples it has never seen before.
The system learns from the data, then generalizes to new cases by comparing them to the learned examples, using a similarity measure.
Model-based learning involves building a model from the dataset and then using that model to make predictions. It is the typical way to do machine learning projects.
Steps involved in the model based learning are:
1.Obtain and analyze the data.
2.Select a model.
3.Train the model using the training data.
You can define a utility function (or fitness function) that measures how good your model is, or a cost function that measures how bad it is.
In linear regression problems, the cost function is widely used to measure the distance between the linear model’s predictions and the training examples to minimize this distance.
4.Apply the model to make predictions on new cases (inference) to test if the model generalizes well.
Insufficient training data
Non-representative training data leading to bias or sampling noise if the training set is too small.
Poor Quality Data.
data with outliers, missing values etc.
Use feature engineering to come up with a good set of features. Feature engineering involves feature selection, feature extraction and creating new features.
5.Overfitting the training data -- Overgeneralizing
The model performs well on the training data, but it does not generalize well on new data.
It occurs when the model is too complex relative to the amount of data and noisiness of the data.
It can be resolved by increasing the amount of training data, reducing the noise in the training data or simplifying the model to use one with fewer parameters.
Regularization, constraining a model, can also be used to avoid overfitting.
We use a hyperparameter which is a parameter of the learning algorithm not the model, to control the amount of regularization.
The hyperparameter is set before training and remains the same during training.
6.Underfitting the training data.
It occurs when the model is too simple to learn the underlying structure of the data.
Underfitting can be resolved by selecting a more powerful model, with more parameters, feeding better features to the learning algorithm and reducing the constraints on the model.
We split the data into training and test sets to determine the model's performance.
The generalization error is the error rate on new cases that the model has never seen before.
If the generalization error is high and the training error is low, then the model is overfitting the training data.
We can also split the data to obtain a validation set.
We will go into greater length on hyperparameters and cross-validation in later posts.
1.Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow.
2.The Hundred Page Machine Learning Book