How to find the most significant variables out of 500?

machine_learning
dimensionality

#1

Hi All,

I am working on a problem having 500 numerical variables and 10,000 observations. In general, my approach of problem solving is, first focus on hypothesis generation then univariate, bi and multivariate analysis but in this case it is is not easy to perform due to 500 variables.

I read some methods like dimensionality reduction to deal with this situation and have suggested PCA. Is there any disadvantge of using PCA and is this easy to communicate to end user that these are most significant variables.

Please help with good techniques to perform dimesionality reduction and which method is effective in which situation.

Regards,
Imran


#2

@Imran - You can also try the feature selection available with most of the algorithms.

You can try using the backward selection technique if you are using regression or you can use the variable importance plot or the variable importance available with various decision tree techniques. I have seen people using random forest specifically for the feature selection.

Mulitcollinearity is also one of the dimension reduction technique. It will tell you the collinearity between the variables.


#3

Hi Imran,
In any analysis you start with more than 1000 variables. The first step we generally do for initial shortlisting is to find information value. The variable with a high IV are considered. Generally this reduces the variable list by 70-80%. This removes all non-variant and variables independent of target variable. Once this is done then we do a principle component analysis which will reduce similar type of variables. Further we remove variables in stepwise through chi-squared test.

Hope this helps.
Tavish


#4

Hi @tavish_srivastava,

I have two questions.

  1. When should i ideally use an algorithm like random forest to determine the importance of the variables? Is it right to use such an algorithm and identify the important variables insteaad of following IV, PCA and chi-square?

  2. If i want to check for the mutlicollinearity, what is the best place to check in the order IV>PCA?>CHI0SQUARE.

Thank you.


#5

Hi,

  1. Random forest variable importance estimates come in handy as they do take into account the interaction parameter. But in case you are building logistic or any kind of linear classifier, random forest is not the best choice to find importance of variables.
  2. We use PCA/FActor analsis/Varclus to remove multi collinearity. So ideally there will be minimal multi collinearity after this step. However, all that matters is post the stepwise if we still have multi collinearity. So it is fine if all VIF on the final variables is low.
    Hope this answer your questions.
    Tavish

#6

Dear @tavish_srivastava, thanks for the reply. How far the preprocessing steps considered important before performing a feature selection? I am using Random forest for variable importance. Is it necessary to do the pre-processing steps like variable data conversion, converting the categories to buckets etc?

Kindly advice.

Thanks,
Regards,
Karthikeyan P


#7

Hey,

Won’t PCA reduce the variables to something that can’t be explained? How will one find and describe important variables after doing PCA?


#8

is there any solution? i 'm planning use decision tree with std replace shannon.
not sure whether it can work…