How to Use NumPy, Pandas, and Scikit-Learn for AI and Machine Learning in Python

Python has become the go-to language for data science and machine learning due to its simplicity and the availability of powerful libraries. Three important Python libraries for AI and ML tasks are NumPy, Pandas, and Scikit-Learn. In this article, we will see how these libraries provide useful capabilities for working with data and building ML models.

NumPy for Numerical Data Processing

NumPy provides an efficient multidimensional array object for working with large datasets in Python. Some ways NumPy can be used for AI/ML tasks:

Storing and processing dataset features and labels as NumPy arrays. This provides speed and memory optimizations.
Mathematical and logical operations on arrays for data preprocessing - scaling, normalization, clipping outliers etc.
Random number generation for parameter initialization, splitting data etc.
Linear algebra operations like dot product, matrix multiplication etc. useful for neural networks.
Integrates with models in Scikit-Learn, TensorFlow, PyTorch etc.

For example, we can normalize an input feature matrix as:

import numpy as np

features = np.array(features) # convert to numpy array
features = (features - np.mean(features, axis=0)) / np.std(features, axis=0) # normalize

Pandas for Data Cleaning and Preparation

Pandas provides easy to use data structures and tools for loading, cleaning, transforming and preparing structured datasets for modeling. Key features:

pd.DataFrame for tabular data manipulation.
Tools for handling missing data, duplications, formatting issues etc.
Split-Apply-Combine operations for fast data transformation.
Merge, join, concatenate datasets.
Built-in methods for scaling, one-hot encoding features.
pd.get_dummies() for one-hot encoding categorical variables.
Sampling, splitting and slicing datasets.

For example, we can load, explore and clean a dataset as:

import pandas as pd

# Load dataset
df = pd.read_csv('data.csv') 

# Explore, summarize and check for null values
df.info()
df.describe()
df.isnull().sum()

# Handle missing values and reformat columns
df['column'] = df['column'].fillna(0)
df['date'] = pd.to_datetime(df['date'])

Scikit-Learn for Building ML Models

Scikit-Learn provides a consistent interface for building and evaluating machine learning models in Python. Key capabilities:

Classification algorithms like SVM, random forest, logistic regression etc.
Regression algorithms like linear regression, decision trees etc.
Model evaluation metrics, cross-validation strategies.
Model selection, hyperparameter tuning, pipeline tools.
Easy model persistence and deployment.

For example, we can train and evaluate a random forest classifier as:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)  

# Train model
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Evaluate on test data 
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

So NumPy, Pandas and Scikit-Learn provide a powerful stack for AI and ML applications in Python. Learning how to leverage these libraries can help build and deploy models more efficiently.

DEV Community

How to Use NumPy, Pandas, and Scikit-Learn for AI and Machine Learning in Python

NumPy for Numerical Data Processing

Pandas for Data Cleaning and Preparation

Scikit-Learn for Building ML Models

Top comments (0)

Read next

Copy files from another branch with Git

New AI System Makes GPT-4 33% Better at Code Tasks, Study Shows

AI System Masters Complex Document Layouts by Reading Like Humans Do

New AI Method Makes Machine Learning More Reliable Using Unlabeled Data