I have gone through Logistic regression - i covered almost ROC curve to chose threshold, confusion matrix , AIC,AUC, overall accuracy , sensitivity,specificity ,precision ,recall. Now i am trying to apply this logistic regression on one of my real life problem. But the data set for this problem have almost all categorical. I know we can go with dummy variables.But the problem is using dummy variables will require me to encode categories which have in some variables more than 100 and almost close to 40 categories
Complaint_NO Compalint_Status Whether it is "Open","Closed","Withdrawn" Complaint_SUb_Status "Progress","Resolved","Satisified" Complain_Owner ID of who is handling/assigned this Complaint Business_Unit Complaint_Owner belongs to which business unit Region Which Region Complaint Raised Area Which Area Complaint Raised AGM Who is this Area Manager (Area General Manager) Store Name of Store Product What Product user is complaining for- 30 different product Sub_Product What Sub Product user is complaining for- 120 different sub product Complaint_Created Date of Complaint Lodged Complaint_Closed Date of Complaint Closed Source What is the source of Complaint "Local State & Fed MP","Fault Management" ,"State & Federal Govt","XXX Shops","Retail Channel","Field Staff","NA","BillPay" there are almost 60 source of this Complaint Complaint_Level It is esclaltion level of Complaints "Level 0","Level 1","Level 2","Level 3","Level 4" SR_Days Number of days Complaints be Opened Root_Cause1 What is the main reason for rasing Complaints e.g. "Product Features" (there are almost 20 root cause1) Root_Cause2 What is the sub reason for raising complaints e.g. "Data Speed/Connection Issues" (there are almost 200 root cause2) Owned_Entity Whether this SR belongs to this store or not if yes then "Owned" otherwise "Not Owned" 26+ Days (If SR_Days>26 then 1 else 0)<---this is the DV which i want to predict at the time of SR lodged or in progress for some days
My question is using above data set is it appropriate to go with the logistic regression in order to find whether a new complaints takes more than 26+ days to resolve or not (SR_DAYS)?
Or it is better to go with Decision tree , I am familiar with decision tree as well but how can i evaluate the decision tree model?
Please correct me decision tree use the same techniques for model evaluation which logistic regression does e.g. Confusion matrix, ROC curve threshold finding , AUC , Over all accuracy ,Sensitivity , Specificity?
Can i go use logistic regression without encoding categorical variable into dummy variable?
Thanks in advance