Decision Tree Pruning and other related queries

Hi All,

I am currently doing ML practice problem where I need to predict “item_sales” (continuous variable). Feature variable are a mix of continuous and categorical variables. I am following these steps :

  1. Taking all the feature variables
  2. Imputing missing data in continuous variable by mean and in categorical data by mode
  3. One hot encoding categorical variables
  4. Fitting a decision tree regressor and getting prediction and r-score
  5. Since decision trees often overfit, pruning it through hyperparameters tuning of max_depth, min_samples_split etc using gridsearchcv
  6. Getting an improved and robust model

Here are my observations and concerns :

Q1. A continuous variable “item_mrp” is getting a very high relative feature importance compared to others. why so ?

Q2. Does one hot encoding make categorical variables less relevant compared to continuous variables ?

Q3. Should I consider dimensionality reduction to improve robustness and remove overfit ? (but my data does not have many features even after one hot encoding)

Q4. What can we do to build decision trees which give high r-score but are also robust (does not overfit and perform well on unseen data) ?

This question is regarding decision trees so please answer accordingly. Help is very much valued.

Hi @ismail18,

Let me take up the questions one by one -

That sounds quite right, the sales may highly depend on the MRP of the product and hence this feature particularly has high importance. Additionally, if the variables in your dataset have high collinearity, there is a chance that the feature importance is distributed among the two variables. When you remove either one, you would notice an increase in the feature importance of the other variable.

The following article covers this concept. Skip directly to the topic “feature importance” .

After you use one hot encoding, the categories of the feature are treated like independent variables, hence there is a chance that a particular category of the feature has high importance and you would not be able to determine the importance of the feature overall.

Since you have less number of features, I would say there is no need to use dimensionality reduction.

You can reduce the max depth for the free A fully grown tree would have very high chances of overfitting. Furthermore, you should use the cross-validation technique and make sure that the r2 is almost the same on both train and validation set. If it’s very good on the training set and not so good on the validation set, it’s a clear sign that the model is overfitting.

1 Like

Hi @AishwaryaSingh,

Thanks for your reply. Your reply to my queries adds much to my understanding.


© Copyright 2013-2019 Analytics Vidhya