Question on Logistic Regression



While breaking down a continuous independent variable, for eg Age, into buckets, is it advisable to include all the age buckets into the model? Or is it advisable to leave out the reference age bucket from the model?


what do you mean by reference age bucket ?

Ideally if you have age from 20 to 80 years as continuous variable and let say you make 6 bands of 10 each (20-30,30-40 etc) then you should include all in the model.

I am not sure what you mean by leaving out reference age bucket.


Hi arihant,

Create dummy variables for Age

GermanCredit$Age_c1 <- ifelse(GermanCredit$Age <=26,1,0)
GermanCredit$Age_c2 <- ifelse(GermanCredit$Age >26 & GermanCredit$Age <=33,1,0)

Please explain in this scenario on what base are we taking inputs for creating dummy varables for age



You want to do binning the age variable or something else ? Please elaborate more so that i can help you



I was going through Logistic regression and after data exploring in the age varable was split as above . know I want to understand for age having many catogories how are we choosing the no of level for creating dummy variables?



From what I understand, you have already binned the age into categories, and now you want to figure out how many dummy variables to create?
If yes, then you need to create total number of categories -1 dummy variables.

Also, to understand which category to keep and which to merge with the base category, check the level of significance once you create the logistic regression model. The Categories which have a low level of significance i.e. whose p-value is not near 0 can be binned together with the base reference category.

Let me know if you need further assistance.


Hi Devashish,

Thank a ton for the responce

19-- 23
23- 26
26-- 28
30 - 33
33 -36
36- 39
39- 45
45 -52

i have the above categories of age variable

Here for splitting them to dummy we need to hav n- 1 levels

but the code demonstrates to two dummy variables

GermanCredit$Age_c1 <- ifelse(GermanCredit$Age <=26,1,0)
GermanCredit$Age_c2 <- ifelse(GermanCredit$Age >26 & GermanCredit$Age <=33,1,0)

How are we deciding here to go with 2 dummy variables
on what base are we coming to a conclusion <=26 as C1

Age >26 & GermanCredit$Age <=33 as C2

Please let me know.



Hi There,

It seems you are just scrapping code without looking at the overall data. Since I too haven’t looked at the data, I cannot be quite sure of the reasons but my guess is something like below.

The code is basically converting the age variable into three categories. So visualize a binary table containing three categories C1, C2 & C3, which correspond to categories less than 26 years, between 26 and 33 years, greater than 33 years.

The above piece of code specifies the coding of the dummy variables. So the above code translates to:
Age Less than 26 is coded as 00
Age between 26 and 33 is coded as 10
Age greater than 33 is coded as 01

Thus as you can see above three levels can be represented by 2 dummy variables.

Regarding the question why you selected these 3 categories from over 10+ age categories ?
The answer lies in the data set. You will need to closely examine that. Mostly what you will see is that these three categories are statistically significant for whatever target variable you are checking.

To understand the statistical significance of the variable you should really take a course on statistics.

Hope that helps, if it does, then please click on the heart shape icon to like my answer :wink:


Hi Devashish,

Please take a look at the below example you will understand the problem.
Attaching the data file.
why are we taking 26 as a condition for creating the dummy variable is my question?


GermanCredit.csv (130.7 KB)