How does factor works in R-glm



Hi Gurus,

As I am new to this tool and data analytics in general, just want to understand how the factors work in R (or probably in more general sense).
For example, I have this set of data

"indicator"	"countrycode"	"distance"
0	US	0.1
0	US	0.18
0	US	0.21
0	US	0.19
1	US	0.2
1	US	0.21
0	GB	0.24
0	GB	0.23
0	GB	0.21
0	GB	0.22
1	GB	0.2
1	FR	0.1

and want to perform logistics regression model to predict the indicator.
myFullLRModel = glm(indicator ~ countrycode + distance,, family=binomial)

So I got the result (as below) which it excludes the value for FR which I have expected.

glm(formula = indicator ~ countrycode + distance, family = binomial, 
    data =

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0474  -0.7889  -0.6405   0.3285   1.9414  

              Estimate Std. Error z value Pr(>|z|)
(Intercept)      15.97    3956.18   0.004    0.997
countrycodeGB   -20.88    3956.18  -0.005    0.996
countrycodeUS   -19.63    3956.18  -0.005    0.996
distance         15.92      28.81   0.552    0.581

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 15.276  on 11  degrees of freedom
Residual deviance: 12.256  on  8  degrees of freedom
AIC: 20.256

Number of Fisher Scoring iterations: 16

Now my question is, when I interchanged the values of the country from GB to FR and/or FR to GB, I expected that it will exclude the coefficients for GB, and it will show coefficients for country code “FR”. But results shows differently (as per below)

"indicator"	"countrycode"	"distance"
0	US	0.1
0	US	0.18
0	US	0.21
0	US	0.19
1	US	0.2
1	US	0.21
0	FR	0.24
0	FR	0.23
0	FR	0.21
0	FR	0.22
1	FR	0.2
1	GB	0.1

              Estimate Std. Error z value Pr(>|z|)
(Intercept)     -4.903      6.490  -0.755    0.450
countrycodeGB   20.877   3956.182   0.005    0.996
countrycodeUS    1.247      1.689   0.738    0.460
distance        15.916     28.809   0.552    0.581

Am I doing something wrong or if this is the expected result, would you be able to explain why is it so? This is just for my further understanding and reference. Thanks.


Dummy variable coding is done as k-1 variables in R glm function, that means if a categorical variable has k no. of categories then the number of dummy variables be k-1, e.g. in your case 2 (US and GB), the reason for eliminating the third one is to control multi-colinearity in the dataset.

Thanks I hope it helps !


Thanks. Now I understand the k-1 variables. Just wanting to understand the logic which level to drop. As per my explanation on my original post, first scenario “FR” was dropped, which from my understanding is because of singularity (please correct me if i’m mistaken). Now on the 2nd scenario, “FR” was dropped as well, which I didn’t expect as I believe “FR” is more significant than “GB” for this case.


Well, if you go through the some material of how and why to create dummy variables you will get to know that Significance has nothing to do with it, effect of all variables automatically get taken in to account by the model. and for your question why “FR” has been dropped that only because glm function drops the variable alphabetically, try changing FR with ZR, and see which one gets dropped.




When you predictor have categories in it like “countrycode”, then R automatically took one as base and displays result for others. Your result shows some p values for US and GB, and both have value of greater than 0.05 means, there is no difference between the beta values of GB and FR and there is no difference between beta values of US and FR. So more or less all the countrycode have the same effect on your indicator.

I hope this will clear your doubts.