Rodolfo Mendes

Posted on Dec 22, 2020 • Originally published at reinforcement-learning4.fun

Basic Setup For a Machine Learning Project With Python

#machinelearning #datascience #python

In this tutorial, we present the process for configuring a basic Machine Learning project using the Python programming language.

1. Check basic tools

The first step is to check if you have the necessary tools installed on your computer. For our configuration, we need the following tools:

Python
pip
virtualenv
git

Python

Most Machine Learning projects and libraries are written in the Python programming language. To run a program written in Python, we need to install the Python interpreter. I highly recommend installing a Python 3 interpreter since the community does not support version 2 anymore.

To check if you have a Python interpreter on your computer and its version, run the following command:

$ python --version

If the interpreter is installed, the command will print the word Python followed by the installed version.

pip

Python comes only with a set of essential libraries in its default installation. However, we need additional tools and libraries to build our Machine Learning projects. Manually downloading and installing these libraries is an error-prone task because we would also need to download the libraries dependencies and accidentally install conflicting versions.

The tool pip addresses this issue by automating the installation of libraries and their dependencies on our system. With a single command, we locate and download the latest version of the library we need and the required dependencies.

To check if pip is installed in our system, we run the following command:

$ pip --version

If the pip is correctly installed, the command will print the installed pip's version and its full path. The new versions of Python install pip by default.

virtualenv

When we install new libraries, pip copies the library and its dependencies into our local Python installation. But when we have multiple projects on our computer, we may need to install different versions of the same library or have libraries that require different versions of the same dependency causing conflicts between them. We avoid this problem by creating virtual environments under the root folder of our project. When we create and activate virtual environments, pip installs the libraries and their dependencies on the virtual environment folder, keeping the system path free of specific project dependencies.

To create and manage virtual environments, we use the tool virtualenv. To check if virtualenv is installed on your computer, run the following command:

$virtualenv --version

If the virtualenv tool is correctly installed, the command will print the installed version of virtualenv and its full path.

git

When developing a Machine Learning project, we want to save different versions of our code to roll back to a previous version if something goes wrong. However, managing different versions of our code using separate files is impractical and may lead us to errors and confusion. A better approach is using a version management tool to keep track of different versions of our code. Git is a widely used version management tool available for free that allows us to track different versions of our code and save them in remote repositories. To check if git is installed on your computer, run the following command:

$git --version

If git is correctly installed, the command will print the installed version of git. If you get an error, then you need to install or fix your git installation.

Now that you have essential tools in place, it is time to create our project structure.

2. Create the project structure

The first step is to create the root folder for our project. You can use your operating system file manager or use a shell command. For Unix-like shell environments, use the following commands to create the directory and then navigate into the new directory:

$mkdir <root-folder-name>
$cd <root-folder-name>

Under the project's root folder, create and activate a virtual environment to keep our libraries and dependencies apart from our system's path:

$virtualenv <virtual-enviroment-name>
$source <virtual-enviroment-name>/bin/activate

For a Windows environment, we may need to use the following command to activate our virtual environment:

$source <virtual-enviroment-name>/Scripts/activate

When we create a virtual environment, the virtualenv command creates a new folder under our project's root directory containing a copy of the Python interpreter, a pip installation, and the path structure to install libraries. The activate script configures the operating system's path variables to the virtual environment folder so that Python and pip execute from the virtual environment.

3. Essential libraries for Machine Learning

After creating the virtual environment, we need to install some essential libraries for Machine Learning and Data Science.

Numpy

Numpy is an optimized library for basic numerical manipulation. It contains objects and functions to store arrays and matrices, to perform linear algebra operations, basic statistics, random number generation, and numerical transformations. To install the latest NumPy version on your project, run the following command:

$pip install numpy

Pandas

Pandas is a data manipulation library. It contains functions and objects to load data from many types of data sources into in-memory tables called data frames. With data frames, we can easily index and transform our data. To install the latest version of pandas, run the following command:

$pip install pandas

Matplotlib

Matplotlib is a library for charting and data visualization. Using Matplotlib, we create highly customizable data visualization for our data exploration analysis. To install the latest version of Matplotlib, run the following command:

$pip install matplotlib

Seaborn

Seaborn is a data visualization library built over Matplotlib. Seaborn provides options for styling Matplotlib charts and a higher level chart gallery. To install the latest version of Seaborn, run the following command:

$pip install seaborn

Scikit-learn

Scikit-learn is a Machine Learning library that provides algorithms for both supervised and unsupervised learning. Scikit-learn provides state-of-the-art implementations for classical Machine Learning algorithms like Linear and Logistic regression, K-Nearest-Neighbors, Decision Trees, and Support Vector Machines. To install the latest version of Scikit-learn, run the following command:

$pip install scikit-learn

After installing the necessary tools for our project, it is interesting to save the project's current configuration for later use. Using pip, we can dump the list of the project's installed dependencies to a text file and then use it to restore our configuration in another environment. To dump the installed dependencies to a text file, use the following command:

$pip freeze > requirements.txt

4. Create a git repository

Now that we have our essential libraries in place, it is time to create a git repository under our project's root folder to track our files' versions. Under the project's root folder, run the following command to initialize a new git repository:

$git init

After initializing a repository, it is interesting to create a .gitignore file and list the file extensions and directories that git should not track. Usually, these ignored files are output files of compilers and other tools or some local configuration that are not part of our work. The website gitignore.io provides a tool to generate .gitignore files with the most common ignored extensions and directories for different programming languages and environments. After creating your .gitignore file, place it under the project's root folder.

It is also interesting to create the files README.md and LICENSE under your project's root folder. The README.md file should contain the essential documentation for your project, like instructions to download and execute your project. Most Git repository browsers automatically render the content of README.md when you access the repository. The LICENSE file should describe the terms and conditions for somebody to use your software.

Then we need to stage our files to the repository index and create our first commit:

$git add .

The command above will stage every file under the current folder to the index. However, it will not stage files and directories that match an entry in the .gitignore file. It is also interesting to check if we added all files correctly with the command:

$git status

If everything is correct, then we create our first commit using the following command:

$git commit -m 'Initial commit'

Keeping our projects only in our local computer is very risky. The laptop could suffer an accident and damage its disks, or a burglar could steal it. In either case, we can lose weeks or even months of hard work if our local repository is the only one we have. To keep our project in a safer place, we can synchronize our local repository to a remote one.

Services like Github, Bitbucket, or Gitlab allows us to keep public and private Git repositories for free. After creating an empty remote repository in one of these services, we can add it as a remote in our local repository:

$git remote add origin <remote-repository-url>

Finally, we can push our code to our remote repository:

$git push origin master

5. An alternative path

An alternative path to create our project structure is to start by creating the remote repository. Using services like Github, Bitbucket, or Gitlab, we can create our remote repository first and initialize it with the files .gitignore, README.md, and LICENSE. Then we use the command git-clone to download the repository to our local computer:

$git clone <remote-repository-url>

The command above will create a mirror of the remote repository in our local computer. Then we navigate our project's root folder and follow the rest of the process, starting from the virtual environment and proceeding to the libraries and dependencies installation.

Conclusion

This guide presented the essential tools and steps to configure a basic Machine Learning project. You may need to install additional tools and libraries depending on your needs. For example, you may want to install Keras or PyTorch for Deep Learning projects, or you may need to install libraries for image or text manipulation. However, with this basic setup, you can start loading and exploring tabular data and creating powerful Machine Learning models. It is worth to mention that because we don't create new projects every day, you don't need to worry about memorizing these steps. You can get back to this guide whenever you need it.