Why is logistic regression required? Why not linear regression?

#1

Hi Guys,

I’ve been reading about logistic regression and the first obvious question is why not linear regression.

On further research, I found that one of the key reasons is that linear regression is unbounded and we need probabilities which should range between 0 and 1. But this gives rise to 2 more questions:

1. Why not cap the values at 0 and 1
2. Why the logistic function and nothing else.

Hoping for a great discussion ahead.

Regards,
Aarshay

Using linear regression for a classification problem
Why not linear regression can not be used for classification?
Explain why regression analysis is not efficient in classifying categorical data?
#2

Hi @Aarshay,

Linear regression is used for numeric values prediction whereas logistic is generally used for classification like default/non_default,fraud/no_fraud etc.
There are structural differences in how these two operate.
Also linear regression has some assumptions which if not true we can use logistic regression for the same problem.
Linear regression assigns values to the response variable whereas logistic assigns probability.
http://statisticalhorizons.com/linear-vs-logistic

Hope this helps!!

#3

Hi @shuvayan:

Thanks for the response.

I think you misunderstood my question. I understand the use case.

But my question is that why not use linear regression for classification problem as well. We have to anyways convert all inputs to numeric for logistic regression. So why not just run a linear regression and then interpret similar to logistic.

What’s the real essence behind using a logistic function? Why only the logistic and nothing else?

I’ll go through the link as well.

Thanks,
Aarshay

#4

One example of why LinReg cannot always be substituted in place of LogReg is provided, starting at around 2:45.

#5

Hi @Aarshay

To my understanding these 3 assumptions to Linear Regression kinda lead to the necessity of Logistic Regression–predicting a Categorical Outcome!

1. Predictor or Independent Var can be caegorical(yes/no etc) or continuous
2. outcome or Dependent Variable MUST be Continuous
3. they must have a linear relationship-- X predicts the outcome of Y either negative or positive

If you plot the x and y on a plot when Y is categorical say male or female would it show a linear relationship? To quote Andy Field “Logistic Regression is based on this principle: it expresses the multiple linear regression equation in logarithmuc terms(called the logit) and thus overcomes the problem of violating the assumption of Linearity”

#6

Hi guys,

Thanks for the interesting responses.

@anon - the video helped.

@swatchat25 - violation of linearity assumption makes sense. Coming from from your point #2, I think another important issue is that linear regression assumes the outcome to be on ratio scale and coding the categories will have an impact. So the results would change if I code the categories as “0 and 1” or “-1 and 1”. I think this won’t be the case in logistic regression as it is based on maximum likelihood estimation and probability of event matters not how it is numerically represented.

Thanks guys. It’s making more sense now. I’ll be happy to discuss further if you have any more points.

Merry Christmas and a Happy New Year!

Cheers,
Aarshay

#7

Hey @Aarshay ,

if you have a 1000 observations and you are dividing your observations into two categories say Male(0) and Female(1) All the observations relating to Men will be categorized as 0 and all observations relating to Women will be categorized as 1. There is not escaping the categories even if you change it to -1 and 1 you are still “categorizing” so as to say. If your Dependent variable is a category yes/no, male/female etc you have to try to evade the violation. Andy Field’s book is a gread read… he is funny and makes everything simple. It’s a big book, try it if you want to

Best,
Swathi

#8

Hi @swatchat25,

I’ll have a look at Andy Field’s book for sure.

I understand your point of coding values numerically. I was just trying to say that the particular numeric coding (i.e. 0,1 or -1,1) will affect the results in case of linear regression but not logistic regression. So the linear regression output will change with different set of coding. But logistic regression output will be the same. So, this is another advantage of using logistic regression.

Cheers,
Aarshay

#9

Hi @Aarshay,

Just to chime in here with a quote from Hosmer and Lemeshow Applied Logistic Regression, "when the outcome variable is dichotomous (i.e. male or female):

1. The conditional mean of the regression equation must be formulated to be bounded between 0 and 1 (the ratio scale you mentioned).
2. The binomial, not the normal, distribution describes the distribution of the errors.
3. The principles that guide an analysis using linear regression will also guide us in logistic regression."

The way that logistic regression satisfies the first condition is pretty cool (…or at least I think so). It takes the linear regression model, Y = I + B*X, where the parameters can range between positive and negative infinity and exponentiates this formula, allowing the parameters to still range between positive/negative infinity, but causing the outcome to be positive (because, as you know, when you exponentiate something, the result is always positive). And then it reformats the outcome into a fraction (or odds ratio) that constrains it to be between 0 and 1 (which works great as a probabilistic estimator).

The second condition makes sense with a dichotomous outcome since the error can only take two values (ya either got it right, or ya didn’t). So the binomial distribution fits the task better for dichotomous classification.

For the third condition, both linear and logistic regression use a form of maximum likelihood to choose parameters, linear regression just does so with a least squares function, and logistic regression instead uses a likelihood function.

I hope this helps and that you check out Applied Logistic Regression by Hosmer and Lemeshow. I really got a lot out of the book.

Thanks!
Nathan

#10

Prof Lemeshow has (had?) a couple of courses on Coursera on applied regression analysis. The first time 'round I felt underprepared to tackle it. Now, with a bit of basic stats under my belt, I wish I could take it again, however no starting date in sight (yet).

#11

Hi @nblack,

Thanks a lot for your response. It makes much more sense now.

I will definitely refer to the logistic regression resource by Hosmer and Lemeshow.

@anon - I just checked and the course archives are still there. Unless you’re keen on getting a certificate, you can go through the material anytime.

Cheers,
Aarshay

#12

@Aarshay, thanks. I don’t see the archive, and that’s perhaps because I am not enrolled in the previous run. In any case, one reason I like MOOCs is, more than any certification, the deadlines help me stay on course (no pun intended).

#13

Hi @anon,

Probably that’s the reason.

I generally prefer the archived ones so that I can finish it in a week and focus on the part which I require. But it depends on different situations

Cheers,
Aarshay

#14

Just got word that new sessions for both courses are coming up.

Feb 22 - Applied Regression Analysis - https://www.coursera.org/course/appliedregression
Apr 4 - Applied Logistic Regression - https://www.coursera.org/course/logisticregression

I just hope that they use the older, user-friendly Coursera platform.

#15

Thanks @anon!

I’ll check them out for sure…