DEV Community

daud99
daud99

Posted on

General confusion related to Feature Selection

Should I do Feature Selection on the entire dataset?

The answer is NO.

The reason being this results in Bais and data leakage. As the matter of fact we always make sure that our TEST data is absolutely unknown and it's only available to assess the performance of our machine learning model. If we are performing Feature Selection on entire dataset this statement doesn't hold true any more.

The model has an unfair advantage as the Features are selected based on all the samples.

When should we do the feature selection?

  1. Firstly, you should split your data into Train and Test Data.
  2. Then, You should do the feature selection on the Training data.
  3. Once, you done the feature selection on the Training data you can train your model.
  4. Now, you can select the same features from the Testing data and perform the prediction.

How our feature selection is effected in case of K Fold Cross Validation usage?

Thing is the order remains the same. First split and then do the Feature Selection.

"CV methods are proven to be unbiased only if all the various aspects of classifier training takes place inside the CV loop. This means that all aspects of training a classifier e.g. feature selection, classifier type selection and classifier parameter tuning takes place on the data not left out during each CV loop. It has been shown that violating this principle in some ways can result in very biased estimates of the true error. "

The right way to Cross Validate with feature selection

scores = []

for train, test in KFold(len(y), n_folds=5):
    xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]

    b = SelectKBest(f_regression, k=2)
    b.fit(xtrain, ytrain)
    xtrain = xtrain[:, b.get_support()]
    xtest = xtest[:, b.get_support()]

    clf.fit(xtrain, ytrain)    
    scores.append(clf.score(xtest, ytest))

    yp = clf.predict(xtest)
    plt.plot(yp, ytest, 'o')
    plt.plot(ytest, ytest, 'r-')

plt.xlabel("Predicted")
plt.ylabel("Observed")

print("CV Score is ", np.mean(scores))
Enter fullscreen mode Exit fullscreen mode

Should I do Feature encoding such as One hot or Ordinal encoding before or after the Feature Selection?

One should do Feature encoding before the Feature selection. One intuition behind it can be as our main aim is to use Encoded feature in our machine learning model then we should find it's importance as well in the way it needs to be used in the model.

References

  1. https://stackoverflow.com/questions/56308116/should-feature-selection-be-done-before-train-test-split-or-after
  2. https://www.nodalpoint.com/not-perform-feature-selection/
  3. https://nbviewer.org/github/cs109/content/blob/master/lec_10_cross_val.ipynb
  4. https://stats.stackexchange.com/questions/64825/should-feature-selection-be-performed-only-on-training-data-or-all-data
  5. https://followthedata.wordpress.com/2013/10/30/the-importance-of-proper-cross-validation-and-experimental-design/
  6. https://datascience.stackexchange.com/questions/95071/should-i-do-one-hot-encoding-before-feature-selection-and-how-should-i-perform-f
  7. https://stats.stackexchange.com/questions/440372/feature-selection-before-or-after-encoding

Discussion (0)