How to find correlation between variable?



I am new to data scientist domain. Here, I am building a regression model, can you please help me with the best way to find the correlation between two categorical variable or one categorical and one continuous variable in SAS?




You can do various statistical test and use visualization methods to find the relation between two categorical and mix of categorical and continuous variables. Like, Chi-square test, T-test, Z-test, Sacked column chart, ANOVA and others.

You can also refer below link.




If the variables are continuous then we can use Pearson correlation coefficient. If one of the variable is categorical then pearson correlation coefficient doesn’t make any sense, in that case we can go for Polyserial correlation & if both the variables are categorical then we can compute polychoric correlation.

Polyserial Correlation :
when you have a continuous variable and a categorical variable then you cannot compute Pearson correlation between them, Ofcourse SAS can give it to us but its interpretation is very wrong. By default, Pearson correlation assumes that both the variables are continuous in nature. So if you have a continuous variable and categorical variable then you can use Polyserial correlation.

Proc corr data= test pearson polyserial;
with categorical-var;
var continuous-var1 continuous-var2;

Note : It works only in SAS 9.3

Polychoric Correlation :
If you have both categorical / dummy / indicator variables i.e. both variables are not continuous then you can use Polychoric Correlation.

Proc freq data=test;
table categorical-var1xcategorcial-var2/plcorr;

I hope it makes sense. :smile:



Que 1) Is there a coefficient that helps on to find the non-linear relationship b/w continuous variables?
Que 2) And how does one know how to transform the variables to get a better relation with output variable ? plot is one way, but what if there are 1000s of variables and its not feasible to plot each of them ?


Hi @melwin_jose,

You can use the Pearson correlation coefficient to find the relationship between continuous variables. It will tell you how well the two continuous variables are correlated to each other.

Instead of using 1000s of variables, you can first find the correlation of each variable with the output(target) variable and take only the most correlated variables. But if all the variables are equally important then you can look at the skewness each variable. It tells you about the distribution of the variable. If the distribution of the variable is skewed, you must transform it.

  • If skewness is less than −1 or greater than +1, the distribution is highly skewed.

  • If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.

  • If skewness is between −½ and +½, the distribution is approximately symmetric.

So, based on the skewness value of each variable you can decide which variable to transform.