IV Categorical Variable For Logistic Regression



Hi ,

I have gone through Logistic regression - i covered almost ROC curve to chose threshold, confusion matrix , AIC,AUC, overall accuracy , sensitivity,specificity ,precision ,recall. Now i am trying to apply this logistic regression on one of my real life problem. But the data set for this problem have almost all categorical. I know we can go with dummy variables.But the problem is using dummy variables will require me to encode categories which have in some variables more than 100 and almost close to 40 categories

Compalint_Status         Whether it is "Open","Closed","Withdrawn"
Complaint_SUb_Status    "Progress","Resolved","Satisified"
Complain_Owner           ID of who is handling/assigned this Complaint
Business_Unit            Complaint_Owner belongs to which business unit
Region                   Which Region Complaint Raised
Area                     Which Area Complaint Raised
AGM                      Who is this Area Manager (Area General Manager)
Store                    Name of Store
Product                  What Product user is complaining for- 30 different product
Sub_Product              What Sub Product user is complaining for- 120 different sub product
Complaint_Created        Date of Complaint Lodged
Complaint_Closed         Date of Complaint Closed
Source                   What is the source of Complaint "Local State & Fed MP","Fault Management"
                        ,"State & Federal Govt","XXX Shops","Retail  Channel","Field Staff","NA","BillPay" 
                         there are almost 60 source of this Complaint
Complaint_Level          It is esclaltion level of Complaints "Level 0","Level 1","Level 2","Level 3","Level 4"
SR_Days                  Number of days Complaints be Opened
Root_Cause1              What is the main reason for rasing Complaints e.g. "Product Features" 
                         (there are almost 20 root cause1)
Root_Cause2              What is the sub reason for raising complaints e.g. "Data Speed/Connection Issues" 
                         (there are almost 200 root cause2)
Owned_Entity             Whether this SR belongs to this store or not if yes then "Owned" otherwise "Not Owned"
26+ Days                 (If SR_Days>26 then 1 else 0)<---this is the DV which i want to predict at the time of SR
                         lodged or in progress for some days

My question is using above data set is it appropriate to go with the logistic regression in order to find whether a new complaints takes more than 26+ days to resolve or not (SR_DAYS)?

Or it is better to go with Decision tree , I am familiar with decision tree as well but how can i evaluate the decision tree model?

Please correct me decision tree use the same techniques for model evaluation which logistic regression does e.g. Confusion matrix, ROC curve threshold finding , AUC , Over all accuracy ,Sensitivity , Specificity?

Can i go use logistic regression without encoding categorical variable into dummy variable?

Thanks in advance



Hi @Blackberry

Logistic regression could work with many variable (thousands) as tree. The point is not the number of dummies, first you date in your variables do you need by day ? if date is with day, year could be enough or week.
Second does all your levels for your variable shave the same proportion I guess no , so why not to group the levels with few observations in a rare level this will reduce the number of dummy. And well if too much try with regularisation this will reduce the number of variables used.

Best regards



Hi Lesaffrea,

Thanks for your suggestion - i will adopt the same next time. This time i have jumped into Decision tree now. So far the model which i created using decision tree is giving me 72% accuracy , trying to improve it more. Any suggestion in this regard would be grateful




Can you try one- hot encoding with the class LabelBinarizer, sklearn always handle them perfectly, and don’t worry about the class created i’ve worked on a dataset with 650 differents categories and it performs well!!!