Variable selection and EDA




I have a dataset with around 120 features out of which around 70 are categorical and the rest are numerical. I 'm looking to perform EDA and select variables which seem to have enough predictive power. Each categorical variable has around 10 levels on average. This dataset contains a binary target variable which I have to predict.


  1. How should I proceed: Select variables and then look at their characteristics? Wouldn’t that make the entire process biased?

  2. Would it be a good method to separate numerical and categorical vars and then run separate variable selection algorithms?

  3. In general, when there are a lot of variables, how is the data explored to gain insights about it?


You can use Genaralised Linear Models. Your response or dependent variable is categorical. Your independent variables can be both categorical and numerical.

You will get result of effects of each independent on dependent seperately and by clubing some variables together, you can find out the results of the combined effects


When the variables are huge it is advised to follow PCA or Factor Analysis to identify the influenced variables. Once you identify you can choose which method you have to use.


I have used PCA before and because of that I know it can only be used on numeric variables. Can Exploratory Factor Analysis be used on categorical variables?


Yes, we can PCA also on categoric var


Hi @krishnamurthypranesh

for categoric variable we use Correspondence Analysis , PCA for purely categorical if you do use one hot will create one order, which is not what you want I think.

If you use a linear model and want to reduce variables other methods could be used.

Best regards