How to handle categorical variables in Logistic Regression?



Dear all ,

I’m new to the analytics field.

I have a problem in logistic regression , I have a few categorical predictor variables in my data. Should i create dummy variables for the categorical variables (i.e job,month ,education,etc)?

After performing logistic regression on the data set, I inferred that I need to drop few variables (i.e. jobretired , contacttelephone,etc) so that I could get a better model . Please suggest if I’m going in the right direction?


Hi Amar,

You can create dummy variables by following mentioned way -

Suppose your data columns job is of 3 level i.e blue-collar, entrepreneur and management. You can create dummy variable as -
Columns Var1 Var2 Var3
blue-collar 1 0 0
entrepreneur 0 1 0
management 0 0 1

After getting these dummy variables, you can run your model and eliminate variables which are less significant(p > 0.05).

Apart from it you can create new variables which makes more business sense(This process is called Feature Engineering), like having orders and visits as regular variables but Orders/visit (Conversion) is a new variable which you can create and use in your models to improve model performance.

Hope this helps.

Aayush Agrawal


Thank you @aayushmnit Aayush for the help. But still i have one more question that when i’m running this same model on test data what i found after creating a confusion matrix is that the accuracy is 90%.So should i go for dummy variable creation to improve my accuracy .

And i’m getting AUC .90. Now my model performed excellent . But i’m scared this might be the case of overfitting . Please suggest.


Hi Amar,

I think when you say you checked performance on test data you have already splitted your data in test and training dataset. If not this is how you can be sure that your model is doing good-

  1. before doing modeling split your data in training and testing data set, 80-20 split respectively
  2. then train your model on training dataset
  3. make confusion matrix and check model performance matrix on training dataset, let’s say accuracy is 90%
  4. now run the same model on testing dataset and make confusion matrix again , if your accuracy measure is dropped only by 1-2% then you can be sure that its not an overfitting problem and your model is working fine. But if your model accuracy is varying too much in training and testing dataset then its a problem of over fitting.

Hope this helps.

Aayush Agrawal


@aayushmnit: Thank you for the help . As you mentioned i did make a confusion matrix on the training set , and found that the performance only dropped by 0.89% percent ,so now we can conclude the model is working good.

Thank you Aayush for helping me out .


@aayushmnit : Hi Aayush,Can you please share me some online materials on Feature Engineering.


Hi Amar,

Feature engineering is extracting the maximum information from your available features in order to improve your model accuracy, this can be done by creating new features or imputing missing data and so on . It’s highly unlikely that you will find a book or some awesome material on feature engineering because it’s mostly specific to a particular problem . But to appreciate the power of feature engineering, please refer to the below link -

This guy has a series of blogs in which he shows how he participated in a Kaggle competition, how he used various algorithms and techniques to solve a particular problem.

Hope this helps.

Aayush Agrawal


Hi @Amar ,

I’m working with the same dataset that @aayushmnit was working with.
I have a test data apart from the one that I split from the training data. I have applied contrasts function to handle the categorical variable in my training data.
Then I split the data into training and test in the ratio 9:1. Now I need to test my model on a new test data. Should I need to apply the contrasts function on the new training set as well? And later predict the values?


Hi @jagdeesh135,

What you can do is -

  1. Merge your training dataset and testing data set
  2. Apply all the feature engineering over the complete train-test dataset
  3. Split it back in training and testing
  4. For validation you can split training dataset further into 80:20, 90:10 ratio

and to answer your question, yes you have to do the same manipulations you have done over your training dataset onto the testing datasets for prediction.

Hope this helps.

Aayush Agrawal


Thank you @aayushmnit. Your suggestions gave me a clear picture :slight_smile: