Regularization with linear regression



hii everyone,
i am studying regularization with linear regression. i came across a statement that L1 is more sparse than L2. what exactly does it mean?


Sparsity means that we’re using less number of variables in our prediction.

In the most crisp form - L1 regularization adds a |x| term which needs to be minimized. The optimization procedure will tend to set to 0 those terms that will minimize this function.

In L2 a |x^2| is added instead. Being squared, the regularization part here becomes even smaller than |x|. Therefore, minimizing a function with L2 regularization will tend not to set to 0 the b term.

Therefore in L1 normalisation number of variables would be less since some of them would be made zero in optimization. Hence, L1 is more sparse than L2.


Thanks for the answer!
I couldn’t get the fact that on squaring, the regularization part becomes even smaller.
Can you elaborate it a bit how is it happening?


Sparsity means that only few of the components in a matrix will be non-zero. When you use L1 regularization , it will cause most coefficients in your model to become completely zero . So, there will be few large and rest zero coefficients. In this way, L1 norm also helps in feature selection (selecting features with non-zero coefficients).

when you use L2 regularization, it will tend to make coefficient values smaller (distributes weights across all features ) but not completely zero. This means that you will still have a lot of non-zero values in your coefficient matrix. That is why L1 is more sparse than L2.

Why this happens, has to do with the shape of the L1 and L2 space. The L2 space is spherical. As a result, there will be many lines (solutions) that will touch this L2 ball at non-zero coefficients. The L1 space on the other hand is diamond shaped , hence the only solutions it will have will be at the axes where one of the coefficients will be zero.

Please refer to the images shown. I have taken them from here:
You can refer to this blog for learning more. Hope it helps. Good Luck!


Hello, Thanks for the clarification.
I have a followup question on this. On what basis Lasso shrinks the coefficients ? i mean which features are selected for shrinkage to zero.


Hello @abhishield1986
Consider an example, where you have 10 features, out of which 4 of them are highly correlated. So lasso will select one of the feature and will shrink the coefficients of other three features to zero.
Hope that this clears your doubt.