After creating multiple models, it’s hard to keep track of all of them especially in a collaborative work environment. Learn how to save and load your models using Pickle!
- Before we begin
- Model differences
- Saving the model
- Loading the model
In our last part, we successfully created a model for a remarketing campaign for the holidays. To review the model, we’ll need to share the model results with our cross functional team and staff. We’ll want our data scientists and data analysts to be able to access the data without remaking the model every time they close their computer.
In this guide, we’ll cover how to export our machine learning model and import it back in using Python, no prior knowledge is required. In part 6, we completed training a classification model, so now we’ll be exporting it. The dataset can also be found here.
When creating a model, it’s worth noting that each time you run it the results may change due to a random value, also known as a seed. Therefore, even when you have the same data, the model may give different results when you run it. For instance, our model uses Logistic Regression to train it, which is a discriminative machine learning algorithm.
This doesn’t mean that the algorithm is discriminatory, but rather it tries to draw a line between our data to represent a boundary. This line is also referred to as the decision boundary. Then, it will classify the data based on where it ends up, depending on which side of the boundary, in our case it’s whether a user will click on the remarketing email.
When data is split into a train and test set, not all values are guaranteed to be the same each time because there’s no set seed. In this case, each time the model is trained, the algorithm will use a pseudo-random value that makes multiple splits that are highly unlikely to be the same.
To keep it consistent, we set random_state equal to a constant value. For this model I’ve chosen 3493 as my seed, to have the same resulting splits making it easier to replicate.
To save our model, we’ll use the pickle function in Python. The pickle function starts by pickling the data, converting it through serialization into a byte stream. This serialization is a sequence of bytes arranged to form the hierarchy, or order, of the original model. Note that only booleans, integers, strings, arrays, dictionaries, functions, classes, and other Python original data types may be pickled. It cannot pickle numpy objects unless using joblib, which has a similar syntax.
The term remains shrouded in mystery as to why it’s called pickle, but a fun way to remember the name is due to the process of why people pickle. Traditionally, many cultures practice pickling as a form of preservation and storage. Having a longer shelf time means they can go back without the food spoiling. Likewise, data scientists aren’t going to complete optimizing a model in one sitting, nor will developers share their computers.
The simplest way to save a model is as a byte object tied directly to a variable. This can be useful if you don’t need it as a file or want to experiment with different models in the same sitting. When using .dumps (with an s), the model is stored into a byte object.
The command to create a pickle file is pickle.dump, which converts the model into a pickle and places it into a file. First, we’ll open the file with write access, write our pickle into it, and then close it.
Similar to saving the model, we’ll use pickle again to load our data with the load and loads function. Once you have a pickle, you can open it up to retrieve the original data using pickle.load. Likewise, pickle.loads will take a byte instead.
Next, we’ll train it on the same split of X_train and X_test and evaluate the scores. When loading a model, results are always the same, since it’s the same model and data.
Thus ends this segment on saving and loading your machine learning model. We hope that you’re able to remember the pickle, the process of pickling, and will pickle and share your machine learning models. In the next series, we’ll export our results stored as save.p to evaluate our model metrics more thoroughly.