My dataset has predefined training, validation and test data with 100,000 features. Can I reduce the features by using decision trees on combined (training + validation) data? Or is that a strict not allowed?
You can use a decision tree for feature selection. However, a random forest or xgboost would be better choices as their variable importance is more robust because boosted and bagged trees are expected to be more stable.
Thanks for your response.
I will set up some experiments with random forest or xgboost.
This pre-processing is to lead to subsequent binary classification.
Help me understand one thing, after having reduced the features on the training+validation dataset, is the dataset somehow compromised for subsequent classification task (since validation data is seen by random forest/xgboost/decision_trees).
I have got nice results using this technique and hence I am second guessing now … did I do something wrong when I used training+validation data for feature selection? should I have used only the training data?
you get my drift.