Which type of regression to use?

#1

Hi there,

My main motive is to derive and understand the causal inference of few(>10) independent variables on (5) dependent variables. Most of my independent variables are categorical (more than 2 categories) and dependent variables are binary/ binomial.

Which type of regression would be best suitable to get the best estimate of causal inference?

Regards
Akshay

#2

Hey,

To answer this question you would need more information, such as the statistical distributions of each of the features, and their correlations with each other and with the target

#3

Thanks for taking time to respond.

I have the statistical distributions. How to validate them so that I can proceed for finding out the model to fit for CI?. For now, I want to find out the individual CI for each variable (categorical). If I can do this efficiently, then only I would proceed to find out multi-variable causal inference.

Regards
Akshay

#4

Hi,

If I get your question right, you are trying to map categorical variables to binary output.

1. Logistic regression would be fine. For that matter any classification algorithm will be fine. You need to convert your categorical variables into flags.
2. Flags: Let us your categorical variable A has 3 categories a1, a2, and a3 and categorical variable B has 2 categories b1 and b2. Please make 2 flags for A (let us say for a1 and a2) and 1 flag for B (let us say for b1).
3. Now you will not be getting the combined effect of all the bins of the variable. You will get the impact of individual flags.

#5

Hi there,
Thanks for taking time to respond and you got my question right.

I have done the flags thing. Though we get the relative effect of each flag on one another, ultimately I or anyone would need a holistic insight of the particular independent categorical variable. If this is feasible, then there is more to do (measuring the effect) by combining two different categorical variables altogether.

I am doing my research on this. Would be glad if you can shed some light on this.

Regards
Akshay

#6

What you are looking for is called IV value
IV value = log(goods/bads) * (goods – bads)

Let us say you have a variable A with 3 bins a1, a2, and a3
Now, a1 has 2 goods and 3 bads, IV = 0.41
And, a2 has 12 goods and 11 bads, IV = 0.09
Finally, a3 has 1 good and 2 bads, IV = 0.69
IV of this variable is 1.19

Higher the IV, better the variable.

#7

If I understand your problem correctly. It is Multi class classification problem
Here is what I can suggest you

1. Target variables are binary. reconstruct a multi class target variables from binary
reverse of get_dummies.
2. Convert the independent variables to dummies
3. Apply Logistic Regression, Naive Bayes,SVM and Random Forest to check the predictions

#8

Yes, I have done this. Till now this was the most simple way with which I could find out something on the variables I have. But to make sure I don’t rely on this completely, I am checking other ways to verify whatever I got with the calculation of weight of evidence and information value.

Thanks
Akshay

#9

Thanks Srilatha

May I know how do I fulfill step 1?. I am currently working on Random Forest to understand the importance of the variables just to verify whatever I got with WOE and IV. (refer to the above reply)

Whatever you suggested in Step 3 would be definitely helpful, only that I have to go step-by-step.

#10

HI Akshay,

Let me know if you still need any help on these or not

#11

Thanks Aranya for revisiting the post.

Actually, I would like to re-visit and continue this discussion after a while. My main goal when I posted this query was to understand causal inference rather than trying to predict future values (the most common thing on which pretty much everyone work on, these days). Then I realized the concept of causal inference and using algorithms to quantify causal inference is somewhat dicey. Does some textbooks help in strengthening the concepts or continuing this discussion would lead me somewhere substantial?