Train, Dev and Test Sets

#machinelearning #python #datascience #tutorial

Introduction

Machine learning is an iterative process that aims to find the best set of parameters for a model or network. It involves conceptualizing a model, implementing it in code, and experimenting with different sets of hyperparameters.

Purpose of Test, Train, and Dev Sets

To determine the best model configurations, the dataset is initially divided into train, dev, and test sets.

The train set is where the model is trained with various sets of hyperparameters. After training multiple models on the training set, the dev set is used for model selection. Subsequently, the best model is evaluated on the test set to assess its performance on new datasets using various evaluation metrics.

Common Practices

Typically, it is acceptable to have a dev set without a test set. The dev set primarily serves for model selection and tuning. In such cases, the dev set is used for testing.
Although the train set and dev set can originate from different distributions, the latter and the test set must always be from the same distribution. This ensures that the dev set remains similar to the test set.
A common ratio for partitioning data is to split it into 80-10-10 portions. However, for large datasets, such as those with one million datapoints, it is permissible to allocate a small percentage (for example, 98-1-1) for the dev and test sets.

Example

Let's consider a simple example of using test, train, and dev sets for a classification task using Python and the scikit-learn library. In this example, we'll use the famous Iris dataset.

# Importing necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Further split the train set into train and dev sets
X_train, X_dev, y_train, y_dev = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

# Create and train a logistic regression model on the train set
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict labels on the dev set
dev_predictions = model.predict(X_dev)

# Evaluate accuracy on the dev set
dev_accuracy = accuracy_score(y_dev, dev_predictions)
print("Dev set accuracy:", dev_accuracy)

# Predict labels on the test set
test_predictions = model.predict(X_test)

# Evaluate accuracy on the test set
test_accuracy = accuracy_score(y_test, test_predictions)
print("Test set accuracy:", test_accuracy)

In this code:

We load the Iris dataset.
We split the data into train and test sets using an 80-20 split.
We further split the train set into train and dev sets using an 80-20 split.
We create a logistic regression model and train it on the train set.
We evaluate the model's accuracy on the dev set.

Finally, we evaluate the model's accuracy on the test set.
This example demonstrates the typical workflow of splitting a dataset into train, dev, and test sets, training a model on the train set, tuning hyperparameters using the dev set, and evaluating the final model on the test set.

Glad to write my first article. Leave your comments down below.
:)

Top comments (1)

ishola-faazele • May 3 '24

just a beginner ml enthusiast. feedback, and mentorship would be happily accepted