How to find the a single categorical variable importance in a set of all independent categorical variables?




Does summing up the individual importance of a categorical variable makes sense when there are only categorical variables in my set of independent variables? I was not convinced with the replies in the above query.



Try Boruta


Thanks for the suggestion. Would definitely try Boruta but is there something in python as I have built the model in python. Calculating in R is not a big deal but will take some time. Which one has more alternatives to solve the problem I have?



Hi @akshay.kotha

There are a few ways in which you can try reducing the features to make the problem less complex.

  1. You can find out the features which are highly correlated, and drop some of them.
  2. If you have an order in the categorical variables, you can go for label encoding instead of creating dummies.
  3. Create a new feature combining multiple other features, and drop the previous features.
  4. You can use lasso regression and then find feature importance.

Is there any particular dataset you’re working on where you faced this problem? If yes, can you share the same?


Thanks for taking time to respond Aishwarya!

Few more questions:

  1. How to find out correlation - any methods?
  2. What do you mean by order in categ. variables? As per my understanding, a categorical variable itself has discrete categories. #noob
  3. When you talk about combining new features, is it about the levels within a particular categ. variable or different categ. variables altogether?



Hi @akshay.kotha

  1. In python, you have a function corr() to find correlation for continuous variables. For categorical variables, you can use chi-square test.
  2. Suppose you have a categorical variable for “Quality” and has levels as Excellent, Good, Fair, Average and Poor, you can replace it with numbers like Excellent is 5 , Good is 4, and so on.
  3. You can refer the following discussion thread
    Modelling technique for categorical predictor and continuous target

Happy Learning!



Hey @AishwaryaSingh,

Sorry for the delayed reply.
Unlike the @cachu 's problem, my problem has binary targets. I am currently figuring out correlation between 8 categorical variables split into >50 feature labels. Hope I could drop some features after getting correlation and then combine some.

Anything you would like to add?



Hi @akshay.kotha

Yes you can drop the highly correlated variables, irrespective of the type of target variable.
Your approach seems fine to me.

Happy Learning!


Hi @AishwaryaSingh,

As you mentioned chi-square test, this test failed due to the lack of data leading to zero frequency rows in contingency tables. Tried Fisher’s Exact test but there is constraint of constant sums which is not my case.

  1. What else can be tried out to find out highly correlated variables as mentioned in the previous reply?
  2. How to find out the correlation parameter value?

Happily learning!