DEV Community

Angelica Lo Duca
Angelica Lo Duca

Posted on

Why You Should Not Trust the train_test_split() Function

Surely almost all data scientists have tried to use the train_test_split() function at least once in their life. The train_test_split() function is provided by the scikit-learn Python package. Usually, we do not care much about the effects of using this function, because with a single line of code we obtain the division of the dataset into two parts, train and test set.

Indeed, using this function could be dangerous. And in this article, I will try to explain why.

The article is organized as follows:

  • Overview of the train_test_split() function
  • Potential risks
  • Possible countermeasures.

1 Overview of the train_test_split() function

The train_test_split() function is provided by the model_selection subpackage available under the sklearn package. The function receives as input the following parameters:

  • arrays — the dataset to be split;
  • test_size — the size of the test set. It could be either a float or an integer number. If it is a float, it should be a number between 0.0 and 1.0 and represents the proportion of the dataset to include in the test set. If it is an integer, it is the total number of samples to include in the test set. If the test_size is not set, the value is set automatically to the complement of the train size;
  • train_size — the size of the train set. Its behavior is complementary to the test_size variable;
  • random_state — before applying to split, the dataset is shuffled. The random_state variable is an integer that initializes the seed used for shuffling. It is used to make the experiment reproducible;
  • shuffle — it specifies whether to shuffle data before splitting or not. The default value is True;
  • stratify — if not None, it specifies an array of frequencies for class labels. This permits the splitting phase the preserve the frequency of class labels as specified.

Usually, we copy the example of how to use the train_test_split() from the scikit-learn documentation and we use it as follows:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

We don’t care much about the effects of this feature. Let’s just go ahead with the code.
But there are potential risks, which I will show you in the next section.

2 Potential Risks

Internally, the train_test_split() function uses a seed that allows you to pseudorandomly separate the data into two groups: training and test set.

The number is pseudorandom because the same data subdivision corresponds to the same seed value. This aspect is very useful to ensure the reproducibility of the experiments.
Unfortunately, the use of one seed rather than another could lead to totally different datasets, and even modify the performance of the chosen Machine Learning model that receives the training set as input.

To understand the problem, let's take an example.

Continue reading on Towards Data Science

Top comments (1)

Collapse
 
mccurcio profile image
Matt Curcio

Someone I know who worked for a Biotech company told me an interesting story.

This person claimed their company did many double-blind human research clinical trials. This company used a computer to randomly choose numbers (keys word here) for their trials.

Apparently, they realized that certain numbers were popping up again and again.
Then they realized that the computer choosing their random numbers had used the same seed for a little while.

They implemented a new random seed number for all clinical trials to be chosen every time for new work. They wrote an SOP so that the new random number seeds would be now based on the time from in seconds since 1970.
Actually, it was the last x numbers, I think. lol