Multi collinearity?

data_science

#1

Hi All,

With out variable transformation how do we remove multi collinearity ?

insights Appreciated

tony


#2

Multicollinearity can be removed by looking at VIF value.


#3

Hi @tillutony

The motive to remove multicollinearity is to make sure that two or more features don’t exhibit the same relationship with the outcome feature.

The ideal way to remove multicollinearity is by dropping the features or using a simple or weighted combination of these features instead. I don’t exactly understand the need for transformation in first place.

Best,
Saurav.


#4

Hi All

Thanks for your inputs.

Do we have any solution or algorithim with out dropping the features to solve multi collinearity? transforming I mean feature extraction(PCA)

Insights much appreciated.

Regards,
Tony


#5

@tillutony

Hi Tony,

i) You could try log transformation of the variable.
ii) Use Partial Least Squares Regression (PLS) or Principal Components Analysis, regression methods that cut the number of predictors to a smaller set of uncorrelated components.
iii) Remove highly correlated predictors from the model.

Thanks,
Abhishek Das


#6

Hi All,

Multicollinearity refers to the predictors that are correlated with other predictors in the model.

It can increase the variance of the coefficient estimates and make the estimates very sensitive to minor changes in the model which results coefficient estimates are unstable and difficult to interpret even coefficients might switch signs came insignificant as well.

VIF is one of the effective measure to identify the multicollinearity among the predictors variable. The model needs to be iterate as many times as needed unless all predictors variable came significant and the respective VIF(<5) come down to a acceptable range.

Now coming to variable transformation, one need to transform all the categorical variable to dummy or indicator variable followed by detecting the multicolinearity before run through the model as per my understanding.

Any feedback is appreciated.

BR,
Subhau


#7

Hi Experts,

Thanks for your inputs again. I am aware of the above conclusions.

If the client wants the features to remain in his model and we observe colinearity between the predictors
How do we troubleshoot this situation? without removing or dropping or transforming or extraction(PCA) is what I would like to know?
Do we have any algorithim which solves the above issue.

Insights much appreciated.

Regards,
Tony


#8

Hi,

I am not sure why customer want a certain feature to be remain in the model. It’s your model needs to decide which feature should be there in the model i.e. came statistically significant and can best explained the response variable else the model will itself be very biased and not be able to exhibits the best possible result.

I am not sure of any algorithim(there might be one or combination of few like any ensemble method which can deal with the scenario without compromising the data quality) which can deal with the above stated scenario.

Could you please state a real life example which can best explain the above scenario you had mentioned.

BR,
Subhau


#9

Hi Subhau,

I faced the above question recently in my last interview and was unable to respond so posted for discussion.
I have mentioned all the above solutions to solve colinearity.

Regards,
Tony


#10

Hi Tony,

My bit of trying to answer this situation will be to get to know the rationale behind the feature to be included in the model and how important it could be in terms of handling the business problem in hand.

Once we get the insights about the feature, one needs to see how a particular model is behaving to explain the dependant variable with and without that specific feature.

Ultimately we trying to predict something based on the data available in hand. If any specific feature is a mandate from business perspective then the data set needs to be enriched accordingly and needs to see the statistical significance of the same even before trying to build the model.

Any feedback is highly appreciated.

Thanks,
Subhau


#11

Hi SubhayuNath,

I agree with your comments but he was pointing me especially which algorithm does this?

I have found the below parameter in random forest

varimp(cf1, conditional=TRUE) # conditional=True, adjusts for correlations between predictors

Regards,
tony


#12

Hi @tillutony

when you face multicollinary you should consider the algorithm you will use for prediction, random forest is certainly not the one to take as the trees are build on a random selection of variables… problem one time yes one time no but the other variable will be taken !!! Boosting will do better as it takes all the variables ( if you use xgboost set the column to 1 ) now how it will build the trees is based on the gain. On the whole boosting could handle multicolinary quite well as the main variable will have the most gain (still check the gain function). Referring to xgboost you can have a check by printing the variable of importance, if not present not used !! GBM is not so strong I think I should check again …
Now with linear model it is different, it will inflate the confidence interval of you parameters, use the VIF (variance inflation factor) to check the effect, now what to accept ? rule of thumb below 2 is ok …
If you use PCA great no correlation, now your problem the contribution of the variable on the Principal Component for this you need the projection, I shall advise to use FactMineR ( and the plug in) it will give you the contribution to the PCA it will be given in term of angle of projection so you could do the conversion, and explain to you customer my main component (if of importance) is made of X% of this variable plus Y% of this one. Similar for PC2 now if you customer wants to know exactly what you can print on table of projection, and cross finger that he could understand it …
Hope this help to answer your customer.
Alain


#13

Hi lesaffrea,

“Thank a ton” for your valuable inputs.

Thanks,
Tony