I am currently doing ML practice problem where I need to predict “item_sales” (continuous variable). Feature variable are a mix of continuous and categorical variables. I am following these steps :
- Taking all the feature variables
- Imputing missing data in continuous variable by mean and in categorical data by mode
- One hot encoding categorical variables
- Fitting a decision tree regressor and getting prediction and r-score
- Since decision trees often overfit, pruning it through hyperparameters tuning of max_depth, min_samples_split etc using gridsearchcv
- Getting an improved and robust model
Here are my observations and concerns :
Q1. A continuous variable “item_mrp” is getting a very high relative feature importance compared to others. why so ?
Q2. Does one hot encoding make categorical variables less relevant compared to continuous variables ?
Q3. Should I consider dimensionality reduction to improve robustness and remove overfit ? (but my data does not have many features even after one hot encoding)
Q4. What can we do to build decision trees which give high r-score but are also robust (does not overfit and perform well on unseen data) ?
This question is regarding decision trees so please answer accordingly. Help is very much valued.