Variable Selection before or after cross validation python



I have a high dimensional dataset for classification - 1500 features and 45000 data point.
My initial approach for modeling was:

  1. Divide the dataset in training and testing
  2. Perform variable selection on training dataset
  3. Create a new dataset with only relevant features and perform cross validation
  4. Validate the model against the testing dataset

I am not sure if my approach is correct. I read online that variable selection should not be performed before cross validation but performing cross validation on a dataset with 1500 takes a lot of time.

I am not sure if my approach is correct and would really appreciate any input on this!


Try PCA to identify top 20 features to reduce your feature count…


Hi @psnh

Yeah you are right. Variable Selection should be performed before cross validation. After feature selection, there should be very less number of features not all the 1500 features. I think you are keeping all the features even after feature selection, keep only the selected features and apply cross validation only on the selected features .


You need to reduce the dimensionality then start variable selection and cross validation. Then build the training and test sets from the reduced data.