Methods for computing coefficients in linear regression




Generally sum of squared error is minimized to compute the regression co-efficients. Why can’t absolute error or say sum of error^4 be minimized to compute the co-efficients? In what ways does the Least square method override other methods?



Regarding the use of bigger exponents, the answer is simple:

This is what happens when an outlier (actually influential point) is present is the data using SSE to compute the coefficients. Just imagine what would happen if we increased the emphasis on outliers even further. The regression line might actually become orthogonal to the real trend, in other words, as wrong as possible. So using bigger exponents is a big no-no.

About absolute errors, it’s actually used, just not as much as squared errors. The key difference is that it gives the same weight to every point in the data, while squared errors give more emphasis to points that are distant from the regression line. What you should use depends on the kind of data and what you want to achieve with it.

The only advantage of squared error over absolute error that I know of is that squared error is differentiable everywhere, while absolute error has a discontinuity in 0. So optimization techniques for squared error are less complex than the ones for absolute error, which translates to being faster.


Thank you! Your answer amused me :slight_smile: . Can you suggest me any material/links to where I can look at the math of the higher powers and absolute error?


Elements of Statistical Learning is a good place to start. It references the L1-norm in chapter 2, but overall the content about absolute error is mostly about regularization (lasso penalty).