Apply feature engineering by converting time series data to numerical values for training machine learning models.
- Before we begin
- The datetime data type
- Converting to date
- What’s next?
In our series so far, we’ve gone over scaling data to prepare for model training. We started with a dataset filled with categorical and numerical values and scaled them so that a computer could understand them. For the remainder of our dataset, we’re almost ready to begin model training; we just need to scale our dates.
In this section, we’ll be revisiting the datatypes of numerical and categorical values. Please read part 1 and part 2 before proceeding if you’re unfamiliar with those terms. We’ll be using the same big_data dataset used throughout the model training guides.
When collecting data to feed into machine learning models, it’s common to have data on when a user signed up. The model can use this information to find hidden correlation between users. Maybe there was a sign-up bonus or event for users when creating an account. The data would reflect on the success and failure and would be considered when reviewing the model.
Dates are important and critical to success, especially when collaborating across different locations or countries. Dates can be written in so many ways, across multiple time zones, so the internet agreed on a standard to be used, under ISO 8601, last updated in 2019. It simplifies dates into what’s known as the datetime format, to represent dates using numerical values to begin formatting.
Our dates are formatted as 2021–11–30 as an example. It follows a year, month, day format. But when you think about what data type it is, it’s hard to say for sure. A computer thinks of it as an object or string at first. But when humans look at it, it’s obviously a number. So what is the actual data type?
In Pandas, there is a to_datetime function that will convert the datatype to a datetime value. This usually requires a formatter that specifies how to parse the input by year, month, day, day of week, month name, hour, minute, second, and even account for 12 hour time or time zones. Datetimes in Pandas follow the strftime format used in UNIX.
In our current dataset we have one datetime value, Dt_Customer, logged when a user first signs up for an account. Upon inspection, it’s a string or object data type.
Looking at the output, we see 21–08–2021, which shows that it is in month, day, year format. By comparing with the cheatsheet, to format it we’ll match it with %d-%m-%Y.
But we aren’t completed yet. Even though we have it in datetime format, machines still cannot understand it. To finish off the conversion, we’ll break down the datetime into their own columns for year, month, and day.
The datetime format must follow the ISO, and contain functions that allow it to parse specific portions. For Pandas we’ll be using the dt.year, dt.month, and dt.day methods.
Once we are sure that the values match, let’s remove the original column so the dataset contains only machine readable values.
Now that all of our data has been modified to be so simple that a computer can understand and generate models. Throughout the series we’ve covered scaling data, filling in missing values, and now converting to datetime. For our finale, we’ll take all of our finished datasets from parts 1 thru 4, and combine them together to begin training a classification model for remarketing on whether we should send or not send another email to our customers.