Multiple categorical IV single continuous DV


Hi I have following data which has multiple independent categorical variables and single dependent continuous variable. A little confused which kind of model should be considered here and what method should be applied for variable reduction?

iv0 iv1 iv2 iv3 iv4 iv5 iv6 dv0
8 32 12 8 9 4 10 71.93
8 32 12 8 8 4 10 71.53
8 8 12 8 9 4 10 71.36
8 32 12 8 3 4 8 70.78
4 32 12 8 9 4 10 70.72

For variable reduction I have considered correlation matrix so far



Since the levels/classes of your independent variables are in numbers, you can retain them and try tree based models. Linear models might not give you good results in this case.

I’ll suggest you to go for Decision trees, Random Forest and GBM.

Hope this helps.


Thanks, I have yet to go through tree based modeling…will go through them. Any regression or ANOVA methods in meanwhile ?


I think that correlation matrix is not a good idea as these independent variables are categorical variables.Better you can use random forest for identifying important independent variables.One way is going for variable importance function in build in RF and another one is you can randomly interchange the values of a independent variable and check if the accuracy is affected or for iv0 reorder the values like c(4,8,8,…) and check if it has some effect or prediction or not.If the accuracy is decreasing for interchanging the values with in a variable then consider it as an important variable.
For the modelling purpose I think better to use boosting/bagging or decision trees.


The reason I did not suggested to use regression is because the I don’t feel the order of numerical classes in your categorical independent variables will convey any real meaning.