DEV Community

Santiago Beroch
Santiago Beroch

Posted on

Feature Engineering in Machine Learning

Why Feature Engineering?

The most important step in every machine learning process is the creation of features that allow us to create a model capable of predicting well. In the majority of the machine learning projects we face, the initial features we get are minimal in comparison with the ones we can create from our dataset.

Keep in mind that feature engineering is a creative process and the way of validating if a certain feature is useful is to test our model with that feature.

Here are some techniques to help you with feature engineering:

New features

Say you have a dataset that includes a timestamp. You may enumerate all the properties of a timestamp and consider what might be useful for your problem, for example:

  • Weekend or not
  • Night or not
  • Business quarter of the year
  • Public holiday or not

You could also include, for a numeric column in your dataset, statistical features: mean, median, standard deviance, max, min.

Transform features

Transformation of a feature is the application of a deterministic mathematical function to each point in a dataset. Say you have a numeric column of data X. You could apply one of the following transformations:

  • Log(X)
  • 1/X
  • X^(1/2)
  • Normalizing X in the range [-1, 1]

The motivation behind transformations is better visualization and interpretability of the data.

Encoding

Encoding is the process of transforming a categorical variable into a continuous variable and using them in the model. There are many encoding possibilities, for example, One-hot encoding:

  • In this method, we map each category to a vector that contains 1 and 0 denoting the presence of the feature or not. The number of vectors depends on the categories which we want to keep.

If you think encoding could help, also check out other encoding techniques: Mean Encoding, Label Encoding, Target Guided Ordinal Encoding.

Interaction between features

Getting a little more creative you could sum, multiply, concatenate the data of 2 columns.
The possible interactions could be a big number, so my recommendation to find useful ones is to place them in a loop and keep the ones with the best results.

Finally, how to test my features?

As explained in the beginning, the way to test your features is by placing them in your model and make a prediction with it.

  • A cool tip! Place them in a Random Forest model. After you predict with it, you can get each feature importance.

Top comments (0)