How to use Chi-square test to find multicollinearity?

multi_collinearity
chi-square-test

#1

How can chi-square test be used to find multicollinearity between categorical variables?
I’m using Ames house dataset where we have 30 to 34 categorical variables, having 3-6 classes each. I want to chi-square the way we use correlation function to find multicollinearity between continuous features.
Is there any better way to handle this problem?


#2

Hi @deva123 ,

Chi-square test is used to find the statistical significance between two categorical variables. For example, in the House Price dataset, you can apply chi-square test on Street and SaleCondition,

from scipy.stats import chi2_contingency
chi2_contingency(pd.crosstab(train['Street'],train['SaleCondition']))

Running this command will give you a list of four variables

  • The test statistic - chi2
  • The p-value of the test - p
  • Degrees of freedom - dof
  • The expected frequencies, based on the marginal sums of the table - expected

The answer would be:

(19.435557555934711,
0.0015941181470109059,
5,
array([[  4.15068493e-01,   1.64383562e-02,   4.93150685e-02,
       8.21917808e-02,   4.92328767e+00,   5.13698630e-01],
    [  1.00584932e+02,   3.98356164e+00,   1.19506849e+01,
       1.99178082e+01,   1.19307671e+03,   1.24486301e+02]]))

Here our P-value is 0.0015941181470109059.

Now when the P-value is less than the significance level (0.05), therefore we conclude that there is a relationship between the two variables SalesCondition and Street.


#3

Thank you for sharing this information .
Most of the time I use chi-square as you mentioned above and if they highly correlated or relation between them is significant then we drop one of the feature (to avoid multicollinearity).
But I want to use chi-square test simultaneously on all categorical variable to check how they are significant with each other, like we use cor() function to all numerical variable at once in R.
Basically, can we find multi-colinearity between categorical variable at once(in table format ), when some of them are ordinal some of them are nominal?


#4

Hi @deva123

If you want to compare p-values for all the variables simultaneously, you can use for loop.
Calculate p-value for all variables corresponding to one variable, and repeat the same.


#5

I think that is the only way i have I thought if there is any direct function which can handle if we pass categorical columns
Thank you for answering…!


#6

hi @AishwaryaSingh
had a doubt in your answer " Calculate p-value for all variables corresponding to one variable, and repeat the same."
Do you mean to say, we take every pair of categorical variable, and do a chisq test?


#7

Hi @ajayram198

My idea was, suppose we have 5 categorical variables(x1,x2…), and we wish to calculate p-value for each pair. I would fix one(lets say x1) , apply chi square test with the other four( x1-x2, x1-x3, x1-x4, x1-x5) , then fix another variable (x2) and repeat the same.


#8

yes I want to say every pair of categorical variable and do chi-sq test but at once with all categorical variable just like we do by using cor() [correlation function] in R with all numerical variable column …
is that possible with ch-sq test or i have to use chi-sq test in for loop…?


#9

Hi there,

This was helpful but how to get decimal rounded off to more than one digit because for whichever variables I calculate the four values, p-values is turning out to be 0.0.


#10

Hi @akshay.kotha,

This might happen when the p-value is actually very small. You can try to changing the number of significant digits after the decimal.

Just for clarification, are you using python? or R?


#11

I am using Python. But in the above example, there are many digits after the decimal without adding any extra constraint.

Regards
Akshay


#12

Hi @akshay.kotha

Can you please share the code that you have used?


#13

Let’s assume if we removed collinearity between predictors using chi-squared and correlation functions but after this pre processioning some collinearity remained in the form of one or more predictors could be functions of two or more of the other predictors. (in context of linear regression ) To remove that can we use variance inflation factor…or do you suggest any other method…???


#14
from scipy.stats import chi2_contingency
chi2_contingency(pd.crosstab(main_rf_ind['subject'],main_rf_ind['country']))
pd.crosstab(main_rf_ind['subject'],main_rf_ind['country'])

The output:
> (27379.108729573338,
> 0.0,
> 56,
> array([[ 1.72628349e+03, 2.79542426e+02, 1.56001146e+01,
> 1.11041809e+03, 4.64815588e+03],
> [ 7.04048522e+02, 1.14008755e+02, 6.36236036e+00,
> 4.52873600e+02, 1.89570676e+03],
> [ 1.77509870e+02, 2.87447225e+01, 1.60412489e+00,
> 1.14181809e+02, 4.77959473e+02],
> [ 2.99547906e+02, 4.85067191e+01, 2.70696076e+00,
> 1.92681803e+02, 8.06556611e+02],
> [ 4.51762619e+02, 7.31553186e+01, 4.08249785e+00,
> 2.90592704e+02, 1.21640686e+03],
> [ 1.10943669e+02, 1.79654515e+01, 1.00257806e+00,
> 7.13636307e+01, 2.98724671e+02],
> [ 1.88626426e+03, 3.05448607e+02, 1.70458321e+01,
> 1.21332445e+03, 5.07891685e+03],
> [ 3.73880164e+02, 6.05435717e+01, 3.37868805e+00,
> 2.40495435e+02, 1.00670214e+03],
> [ 6.65662013e+01, 1.07792709e+01, 6.01546835e-01,
> 4.28181784e+01, 1.79234803e+02],
> [ 6.19021295e+03, 1.00240033e+03, 5.59398453e+01,
> 3.98180514e+03, 1.66676417e+04],
> [ 6.43473279e+02, 1.04199619e+02, 5.81495274e+00,
> 4.13909058e+02, 1.73260309e+03],
> [ 3.37335319e+03, 5.46257519e+02, 3.04843884e+01,
> 2.16988256e+03, 9.08302234e+03],
> [ 1.16890249e+03, 1.89283997e+02, 1.05631624e+01,
> 7.51887213e+02, 3.14736313e+03],
> [ 4.21364054e+02, 6.82327849e+01, 3.80779146e+00,
> 2.71039069e+02, 1.13455630e+03],
> [ 2.21887338e+02, 3.59309031e+01, 2.00515612e+00,
> 1.42727261e+02, 5.97449342e+02]]))

This is the piece of code. By the way, I figured out the problem that chi-square is not applicable when there are ZERO frequencies in the contingency table. Which test would be useful now? It seems Fischer’s test from whatever I studied till now.

Regards
Akshay


#15

Hi @deva123,

You can use Variance Inflation Factor to measure of collinearity among predictor variables. If VIF is high, you can treat the variables accordingly.


#16

Hi @akshay.kotha

As you noticed, chi-square test fails when you have zero frequency while Fisher’s Exact Test has no such condition. It is an alternative for chi square test, which can be used when you have at least one value in both, row and column.


#17

Hi Aishwarya,

Thanks for following up.

This is the image of fischer’s test formula - 2X2 contingency table.

It seems that Fischer’s test is only applicable for a 2X2 contingency table. Not found any sources explaining application of Fischer’s test for mXn table in python (my case). Please share if you know any other sources or alternatives to Fischer’s test.

Regards
Akshay


#18

Hi @deva123
well the way to do is to use log linear and not chi square which allows between two categorical only, you have 30. It is a equivalent of correlation for continuous variable for categorical.
But I think your point is you want to use in a model, which does not allow muti-collinary type (dependence in this case). Solution change model … some will tolerate.
Best regards
Alain


#19

Thanks for reply


#20

Hi @Lesaffrea,

I think changing model is the last option. In a similar problem, I have tried Fisher’s, chi-square(which failed), currently looking at G-test. Though, I want to understand how the correlation values really help to use in any model.

Akshay