Types of Regularization that we are going to pull together in this article

  • Lasso
  • Ridge
  • Elastic Net

Overfitting to training data is a big problem which leads to large variance when trying to predict test data.

  • To overcome this we add some bias to reduce variance. This bias comes in form of additional term in loss function which is different for different cases outlined above.
  • It can also be taken as a constraint on parameters of the equation, which while minimizing the loss function can be added as a Lagrange’s multiplier for optimization.
  • This discourages learning a more complex or flexible model, so as to avoid risk of overfitting. The least squares method cannot tell the difference between more useful and less useful predictor variables and includes all the predictors while developing a model. The shrinkage methods helps deal with that.

We start with a common sum of squared residual as a loss function which is

This usually is minimized for estimated beta. Now if β(coefficient matrix) is estimated in such a way that the given model is very close to each observed value, then we say model is overfitted because we can not necessarily make good predictions for future unknown data. Therefore we add constraints to this function… constraints with different types of regularization are given below.

RIDGE REGRESSION (L2 Regularization)

Here the penalty or additional term in loss function is sum of squared βi.

As the penalty coefficient ( λ) of ridge regression is inversely proportional to sum of βi for a given loss value, so increase in λ reduces the slope, and pushes the parameters towards 0(not exact). We can use k fold cross validation to determine which λ results in low variance.

Ridge regression can also be used in discrete independent variables, logistic regression(sum of likelihood + λ*slope²).

One interesting use of ridge regression is when we do not have enough data samples to fit our equation parameters, we can use ridge regression to still estimate the values of parameters.

This technique is used to eliminate multicollinearity in data. Ridge regression constraint variables form a circular shape when plotted.

Points to take care of before applying ridge regression, we should ensure that all independent and dependent variables are on same scale(preferably centered about mean and variance 1). This is not an issue when we are dealing with simple linear regression as βi are never considered independently and always xiβi comes into picture.


LASSO REGRESSION (L1 Regularisation)

LASSO stands for Least Absolute Shrinkage and Selection Operator .In this case the penalty term is sum of |βi| rather sum of βi².

This difference results in ability to shrink the parameters values to exact 0, . Since it can exclude some variables by shrinking their coefficients in equations to zero, therefore they are more biased and reduce variance of testing data a little more. This is thereby a feature selection method as well.

With respect to λ remain same. Larger the λ more coefficients forced to zero. They offer high prediction accuracy due to this shrinkage.

Lasso forms a diamond shape in the plot for its constraint region. This shape has corners as opposed to ridge. The proximity of the bigger red ellipse wiht diamond on y axis implies that β1 is forced to 0.

Similar to Ridge this also requires standardizing.

Ridge VS Lasso

  • Often neither one is overall better.
  • Lasso can set some coefficients to zero, thus performing variable selection, while ridge regression cannot.
  • Both methods allow to use correlated predictors, but they solve multicollinearity issue differently:
  • * In ridge regression, the coefficients of correlated predictors are similar;
  • * In lasso, one of the correlated predictors has a larger coefficient, while the rest are (nearly) zeroed.
  • Lasso tends to do well if there are a small number of significant parameters and the others are close to zero (ergo: when only a few predictors actually influence the response).
  • Ridge works well if there are many large parameters of about the same value (ergo: when most predictors impact the response).
  • However, in practice, we don't know the true parameter values, so the previous two points are somewhat theoretical. Just run cross-validation to select the more suited model for a specific case.
  • Or... combine the two!


Elastic Net is a middle way between Ridge and Lasso

Two ways that I have found it happening is either by considering convex combination of Ridge and Lasso given by

α being the mixing parameter.

And other method to make the loss function strictly convex and hence unique minimum is by adding sum of square of norm., which is by

Meanwhile another naive method is first finding ridge coefficients and then does Lasso type shrinkage. This kind of estimation incurs a double amount of shrinkage, which leads to increased bias and poor predictions. To correct for such effects, the coefficients are rescaled by multiplying them by (1+λ2).

This was all about Regularization, I suggest going through my resources to know more.

Please like and follow me on Instagram @ codatalicious

Problem Solver, Data Science, Actuarial Science, Knowledge Sharer, Hardcore Googler