I have a high dimensional dataset for classification - 1500 features and 45000 data point.
My initial approach for modeling was:
- Divide the dataset in training and testing
- Perform variable selection on training dataset
- Create a new dataset with only relevant features and perform cross validation
- Validate the model against the testing dataset
I am not sure if my approach is correct. I read online that variable selection should not be performed before cross validation but performing cross validation on a dataset with 1500 takes a lot of time.
I am not sure if my approach is correct and would really appreciate any input on this!