While going through statistics, I found that the chi-square test is used to attempt the rejection of the null hypothesis. However, I have failed yet to understand how this analysis is implemented during data exploration. And exactly how important it is compared to other techniques that help us determine the correlation between two variables?
Here is one way to use Chi-square test. Assume that there are 2 features (X1, X2) involved in a supervised learning problem. Use the following code:
tab <- table (X1, X2)
If you get a large p value (fail to reject the null hypothesis), it means that the variables X1,X2 are correlated and so you can eliminate one of the variables in modeling. If you get a small p value (Null Hypothesis is rejected) then use both the variables in modeling.
Hope this helps
Thanks! It really helped.
Also, can the variables X1 and X2 be of different types,like categorical and continuous? Are there any such constraints on its use?
I have only seen it being used in the context of “categorical variables”. Typically, you can use standard correlation techniques for continuous variables as they are numeric. But corr function in R will not work for factor variables and hence chi-squared test is a good way to look for dimensionality reduction in the case of categorical variables.
Ok. Thanks again!
I used it on one of the Kaggle datasets
data1 <- read.csv("San Francisco/train.csv') chisq.test(table(data1$name,data1$house)) data: table(data1$Category, data1$DayOfWeek) X-squared = 6057.747, df = 228, p-value < 2.2e-16 chisq.test(table(data1$Category,data1$PdDistrict)) data: table(data1$Category, data1$PdDistrict) X-squared = 125420.1, df = 342, p-value < 2.2e-16
However when i tried the following
data: table(data1$Category, data1$Category) X-squared = 33365862, df = 1444, p-value < 2.2e-16
I still see a very small p-value. Is this expected?
What is the prob that the levels in Category affect the levels in Category ??It will be very low right?
Chi-sq tests are used to find the correlations between various levels of categorical levels: does age(say in bins of 25-30,30-45,45+) affect brand preference:Pepsi,Coke,Limca?
But if you try to find out if Brand Preference affects Brand Preference it will be a very low probability.
Hope this helps!!