This article is for current and aspiring machine learning practitioners looking to implement solutions to real-world machine learning problems. It is an introductory article suitable for beginners with no previous knowledge of machine learning or artificial intelligence (AI).
This is the first article on my series "Machine Learning with Python". I will introduce the fundamental concepts of Machine Learning, its applications and how to set up our working environment as well as a hands on practices on a simple project.
Introduction to Machine Learning
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed.
Classification of Machine Learning
At a broad level, machine learning can be classified into three types:
- Supervised learning
- Unsupervised learning
- Reinforcement learning
Why Python?
Python has become the lingua franca for many data science applications providing data scientists with a large array of general and special-purpose functionality.
- It combines the power of general-purpose programming languages with the ease of use of domain-specific scripting languages like MATLAB or R.
- It has libraries for data loading, visualization, statistics, natural language processing, image processing, and more.
- It has the ability to interact directly with the code, using a terminal or other tools like the Jupyter Notebook.
Importance or Machine learning
- Rapid increment in the production of data
- Solving complex problems, which are difficult for a human
- Decision making in various sector including finance
- Finding hidden patterns and extracting useful information from data.
Applications of Machine Learning
- Self-driving cars
- Robotics
- Language Processing
- Vision Processing
- Forecasting stock market trends
- Recommendation systems
- Image and Speech recognition
- Predictions
- Detection
Machine learning Life cycle
- Data Gathering/extraction
This is the first phase of the machine learning lifecycle that is concerned with identifying and obtaining all data-related problems.
There are many data sources which we can gather data from. such include files, database, internet, or mobile devices.
This step includes the below tasks:
- Identify various data sources
- Collect data
- Integrate the data obtained from different sources
- Data Preparation
Data preparation is a step where we put our data into a suitable place and prepare it to use in our machine learning training.
After gathering data, we need to prepare it so that we can use it in our project. this phase can be divided into two:
i). Data exploration
It's utilized to figure out what kind of data we're dealing with. We must comprehend data's features, format, and quality. In this, we find Correlations, general trends, and outliers.
ii). Data Preprocessing / wrangling
Data preprocessing is the process of transforming raw data into an understandable format.
In real-world applications, collected data may have various issues, including:
- Missing Values
- Duplicate data
- Invalid data
- Noise
- Data Analysis
The goal of this step is to create a machine learning model that will study the data using a variety of analytical approaches and then evaluate the results. It begins with the identification of the problem type, followed by the selection of machine learning techniques such as classification, regression, cluster analysis, association, and so on, followed by the construction of the model using prepared data, and finally the evaluation of the model.
- Model Training
In this step we train our model to improve its performance for better outcome of the problem. Training a model is required so that it can understand the various patterns, rules, and, features.
- Model Testing
We test our machine learning model once it has been trained on a specific dataset. We check the correctness of our model in this stage by feeding it a test dataset.
The percentage correctness/accuracy of the model is determined by testing it against the project or problem's requirements.
- Model Evaluation and Improvement
Model evaluation is an important step in the creation of a model. It assists in determining the optimal model to represent our data and how well that model will perform in the future.
- Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in the real-world system.
Popular Python Libraries and Tools for Machine Learning
- Jupyter Notebook
It is an interactive environment for running code in the browser.
- Numpy
NumPy is a python library mainly used for working with arrays and to perform a wide variety of mathematical operations on arrays.
- Pandas
Pandas is a Python library for data wrangling and analysis.
- Matplotlib
It is the primary scientific plotting library in Python. It provides functions for making publication-quality visualizations such as line charts, histograms, scatter
plots, and so on.
- Scikit-learn
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.
Environment setup
Installing Anaconda and Python
Download and Install Anaconda (Python 3.6 version) Download and choose according to your OS.
- Open a terminal
- Confirm conda is installed correctly, by typing:
conda -V
- Confirm Python is installed correctly by typing:
python -V
- Confirm your conda environment is up-to-date, type:
conda update conda
conda update anaconda
Hands on practice: Understanding & Classifying Iris Species
In this section, we will go through a simple machine learning application and create our first model.
The data we will use for this example is the Iris dataset, a classical dataset in machine learning and statistics. It is included in scikit-learn in the datasets module. We can load it by calling the load_iris function:
from sklearn.datasets import load_iris
iris_dataset = load_iris()
This dataset consists of 3 different types of irisesβ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray
The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width. Our goal is to create a machine learning model that can learn from the measurements of known-species irises in order to predict the species of a new iris.
This is a supervised learning problem because we have measurements for which we know the correct iris species. We wish to anticipate one of numerous options in this situation (the species of iris). This is an illustration of a classification issue. The possible outputs (various irises species) are referred to as classes. This is a three-class classification problem since each iris in the dataset belongs to one of three classes. The species of this flower is the desired output for a single data point (an iris). The species to which a data point belongs is referred to as its label.
print("Target:\n{}".format(iris_dataset['target']))
Output
Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
The meanings of the numbers are given by the iris['target_names'] array: 0 means setosa, 1 means versicolor, and 2 means virginica.
Measuring Success: Training and Testing Data
We can't evaluate the model using the same data we used to generate it. This is because our model can always remember the entire training set and, as a result, will always predict the proper label for any given point in the training set.
To evaluate the model's performance, we present it with new data (data it hasn't seen before) and labels. This is often accomplished by dividing the labeled data (in this case, our 150 flower measurements) into two halves. The training data or training set is a subset of the data that is utilized to develop our machine learning model. The remaining data will be used to evaluate the model's performance; this is known as the test data, test set, or hold-out set.
The train test split function in scikit-learn is a function that shuffles and separates the dataset for you. As the training set, this function extracts 75% of the rows in the data, together with the accompanying labels for this data. The test set is made up of the remaining 25% of the data as well as the remaining labels.
NB: In scikit-learn, data is usually denoted with a capital X, while labels are denoted by
a lowercase y. Letβs call train_test_split on our data and assign the outputs using this nomenclature:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
iris_dataset['data'], iris_dataset['target'], random_state=0)
The output of the train_test_split function is X_train, X_test, y_train, and y_test, which are all NumPy arrays. X_train contains 75% of the rows of the dataset, and X_test contains the remaining 25%
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
Output:
X_train shape: (112, 4)
y_train shape: (112,)
Inspecting our data
One of the best ways to inspect data is to visualize it. One way to do this is by using a scatter plot. A scatter plot of the data puts one feature along the x-axis and another along the y-axis, and draws a dot for each data point.
# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# create a scatter matrix from the dataframe, color by y_train
grr = pd.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)
The data points are colored according to the species the iris belongs to. To create the plot, we first convert the NumPy array into a pandas DataFrame. pandas has a function to create pair plots called scatter_matrix. The diagonal of this matrix is filled with histograms of each feature:
The three classes appear to be relatively well distinguished using the sepal and petal measurements, as seen in the graphs. This means that a machine learning model will almost certainly be able to distinguish them.
Model Building: k-Nearest Neighbors
we will use a k-nearest neighbors classifier, which is easy to understand The training set is the only thing that needs to be stored while creating this model. The algorithm identifies the point in the training set that is closest to the new point to create a prediction for a new data point. The label of this training point is then assigned to the new data point.
Instead of employing only the closest neighbor to the new data point, the k in k-nearest neighbors denotes that any fixed number k of neighbors can be included in the training (for example, the closest three or five neighbors). The majority class among these neighbors can then be used to construct a prediction.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
Output:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=1, p=2,
weights='uniform')
Making Predictions
After building our model, we are now ready to make predictions. To make a prediction, we call the predict method of the knn object:
prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(
iris_dataset['target_names'][prediction]))
Our model predicts that this new iris belongs to the class 0, meaning its species is
setosa.
Model Evaluation
This is where the test set that we created earlier comes in. This data was not used to build the model, but we do know what the correct species is for each iris in the test
set. Therefore, we can make a prediction for each iris in the test data and compare it against its label (the known species). We can measure how well the model works by computing the accuracy, which is the fraction of flowers for which the right species was predicted:
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))
Output:
Test set predictions:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0 2]
Test set score: 0.97
For this model, the test set accuracy is about 0.97, which means we made the right prediction for 97% of the irises in the test set.
Did you like this article? If Yes, please leave a comment below
Lets connect on twitter and linkedin
Happy Pythoning!ππ
Top comments (1)
Love Anaconda! I just gave a talk to my local Python users group on Anaconda.
bit.ly/3qLI2VX
But I am always surprised that more people do not mention R-cran & RStudio for ML. ;)