General confusion related to Feature Selection

#featureselection #machinelearning #python #scikitlearn

Should I do Feature Selection on the entire dataset?

The answer is NO.

The reason being this results in Bais and data leakage. As the matter of fact we always make sure that our TEST data is absolutely unknown and it's only available to assess the performance of our machine learning model. If we are performing Feature Selection on entire dataset this statement doesn't hold true any more.

The model has an unfair advantage as the Features are selected based on all the samples.

When should we do the feature selection?

Firstly, you should split your data into Train and Test Data.
Then, You should do the feature selection on the Training data.
Once, you done the feature selection on the Training data you can train your model.
Now, you can select the same features from the Testing data and perform the prediction.

How our feature selection is effected in case of K Fold Cross Validation usage?

Thing is the order remains the same. First split and then do the Feature Selection.

"CV methods are proven to be unbiased only if all the various aspects of classifier training takes place inside the CV loop. This means that all aspects of training a classifier e.g. feature selection, classifier type selection and classifier parameter tuning takes place on the data not left out during each CV loop. It has been shown that violating this principle in some ways can result in very biased estimates of the true error. "

The right way to Cross Validate with feature selection

scores = []

for train, test in KFold(len(y), n_folds=5):
    xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]

    b = SelectKBest(f_regression, k=2)
    b.fit(xtrain, ytrain)
    xtrain = xtrain[:, b.get_support()]
    xtest = xtest[:, b.get_support()]

    clf.fit(xtrain, ytrain)    
    scores.append(clf.score(xtest, ytest))

    yp = clf.predict(xtest)
    plt.plot(yp, ytest, 'o')
    plt.plot(ytest, ytest, 'r-')

plt.xlabel("Predicted")
plt.ylabel("Observed")

print("CV Score is ", np.mean(scores))

Should I do Feature encoding such as One hot or Ordinal encoding before or after the Feature Selection?

One should do Feature encoding before the Feature selection. One intuition behind it can be as our main aim is to use Encoded feature in our machine learning model then we should find it's importance as well in the way it needs to be used in the model.

DEV Community

General confusion related to Feature Selection

Should I do Feature Selection on the entire dataset?

When should we do the feature selection?

How our feature selection is effected in case of K Fold Cross Validation usage?

Should I do Feature encoding such as One hot or Ordinal encoding before or after the Feature Selection?

References

Top comments (0)

Read next

Machine Learning for Software Engineers: A Comprehensive Theoretical Foundation

Top AI Search Engines for Business & Startups in 2025

Gemini 2.0 Released, Reminding of "AI Hitting the Wall" Talks

Dec 12 - Virtual AI, Machine Learning and Computer Vision Meetup with Meta AI