“It is better to be approximately right than precisely wrong”
― Warren Buffett
Overfitting is an important issue that forms the essential part of every Data Scientist’s and Machine Learning Engineer’s modeling checklist. So, if you are using a Statistical, Econometrical or Machine Learning model, no matter how simple your ML model is, you should always make sure your model is not overfitting. Otherwise, you are running the chance to have a nice model on paper while in the reality the model is performing very poorly. In this blog post, I will cover the following topics :
– Model Error Rate
– What is Overfitting
– Irreduccable Error
– Model Bias
– Model Variance
– Bias-Variance Trade-Off
– What is Regularization?
– Ridge Regression and L2 norm
– Pros and Cons of Ridge Regression
– Lasso Regression and L1 norm
– Pros and Cons of Lasso Regression
If you have no prior Statistical knowledge or you want to refresh your knowledge in the essential statistical concepts before jumping to the formulas in this article and other Statistical and ML concepts, you can check this article: Fundamentals of statistics for Data Scientists and Data Analysts
Note that this article is the extended version of my previous article introducing the Bias-variance Trade-Off: Bias-Variance Trade-off in Machine Learning
Model Error Rate
In order to evaluate the performance of the model, we need to look at the amount of error it’s making. For simplicity, let’s assume we have the following simple regression model which aims to use a single independent variable X to model the numeric Y dependent variable, that is we fit our model on our training observations {(x_1,y_1),(x_2,y_2),…,(x_n,y_n)} and we obtain the estimate f^t (f_hat).
We can then compute f^(x_1),f^(x_2),…,f^(x_n). If these are approximately equal to y_1,y_2,…,y_n, then the training error rate (e.g. MSE) would be small. However, we are really not interested in whether f(x_k) ≈ y_k; instead, we really want is to know whether f(x_0) is approximately equal to y_0, where (x_0, y_0) is an unseen test data point, not used during the training of the model. We want to choose a method that gives the lowest test error rate, as opposed to the lowest training error rate. Mathematically, the model error rate of this example method can be expressed as follows:
The fundamental problem with using training error rate to evaluate the model performance is that there is no guarantee that the method with the lowest training error rate will also have the lowest test error rate. Roughly speaking, the problem is that many ML or statistical methods specifically estimate model coefficients or parameters to minimize the training error rate. For these methods, the training error rate can be quite small, but the test error rate is often much larger.
The fundamental problem with using training error rate to evaluate the model performance is that there is no guarantee that the method with the lowest training error rate will also have the lowest test error rate. We want to choose a method that gives the lowest test error rate, as opposed to the lowest training error rate.
What is Overfitting?
The term overfitting relates to the poor performance of the model. When the Machine Learning model performs well on the training data with a low error rate (e.g. low training MSE) but when applied on the test data it results in a higher error rate (e.g. high test MSE) we call this overfitting. When the opposite is true, that is the ML model fails to follow the data closely and to accurately capture relationships between a dataset’s features and a target variable, we call it underfitting.
This happens when the Machine Learning model follows the training data too closely and takes into account the noise in the data. Therefore, once the data is changed, for example, the test data is used, then the model struggles to find the true relationship between the features in the data.
To understand the problem of overfitting, you need to be familiar with the Bias-Variance Trade-Off, know whatIrreducible Error, Bias and Variance of the Machine Learning model are. Additionally, you need to know the composition of the model error rate. Finally, you need to know how these terms relate to model flexibility, and model performance.
The solve overfitting problem you have two options:
- choose another model that is less flexible (e.g. models that are known for being less flexible have higher bias but lower variance)
- adjust the model to make it less flexible (Regularization)
Irreducible Error
The accuracy of yˆ as a prediction for y depends on two quantities, which we can call the reducible error and the irreducible error. In general, fˆ will not be a perfect estimate for f, and this inaccuracy will introduce some errors. This error is reducible since we can potentially improve the accuracy of fˆ by using the most appropriate Machine Learning model to estimate f. However, even if it was possible to find a model that would estimate f perfectly so that the estimated response took the form yˆ = f(x), our prediction would still have some amount of error in it. This happens because y is also a function of the error term ε, which, by definition, cannot be predicted using predictor x.
So, variability associated with error ε also affects the accuracy of the predictions. This is known as the irreducible errorbecause no matter how well we estimate f, we cannot reduce the error introduced by ε. Hence, irreducible error in the model is the variance of the error terms ε and can be expressed as follows:
Unlike reducible error, irreducible error is an error that we cannot avoid nor reduce by choosing a better model which arises due to randomness or natural variability in a system.
Bias of Machine Learning Model
The inability of the model to capture the true relationship in the data is called bias. Hence, the ML models that are able to detect the true relationship in the data, have low bias. Usually, complex models or more flexible models tend to have a lower bias than simpler models. Mathematically, the bias of the model can be expressed as follows:
The inability of the Machine Learning model to capture the true relationship in the data is called bias.
Variance of Machine Learning Model
The variance of the model is the inconstancy level of model performance when applying the model to different data sets. When the same model that is trained using training data performs entirely differently than on test data this means there is a large variance in the model. Complex models or more flexible models tend to have higher variance than simpler models.
Bias-Variance Trade-Off
It can be mathematically proven that the expected test error rate of the Machine Learning model, for a given value x0, can be described in terms of the Variance of the model, the Bias of the model, and the irreducible error of the model. More specifically, the error in the supervised Machine Learning model is equal to the sum of the Variance of the model, squared Bias, and the irreducible error of the model.
So, mathematically, the error in the supervised model is equal to squared Bias in the model, the variance of the model, and the irreducible error.
Hence, to minimize the expected test error rate, we need to select a Machine Learning method that simultaneously achieves low variance and low bias. However, there is a negative correlation between the Variance and the bias of the Model.
Complex models or more flexible models tend to have a lower bias but at the same time, these models tend to have higher variance than simpler models.
Let’s get the earlier graph back, again:
As a general rule, as the flexibility of the methods increases, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test error rate will increase or decrease.
As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test error rate declines. However, at some point, increasing flexibility has little impact on the bias but starts to significantly increase the variance. So, it’s all about finding that balance, the best-fit point, where the Test Error Rate is about to change its direction and move upwards.
Based on the Bias and Variance relationship a Machine Learning model can have 4 possible scenarios:
- High Bias and High Variance (The Worst-Case Scenario)
- Low Bias and Low Variance (The Best-Case Scenario)
- Low Bias and High Variance (Overfitting)
- High Bias and Low Variance (Underfitting)
Complex models or more flexible models tend to have a lower bias but at the same time, these models tend to have higher variance than simpler models.
What is Regularization?
Regularization or Shrinkage is a popular way to solve the overfitting problem. The idea behind the regularization is to introduce a little bias in the Machine Learning Model while significantly decreasing the variance. The reason why it’s called Shrinkage is that this method shrinks some of the estimated coefficients towards zero, so to penalize them for increasing the variance of the model. The two most popular regularization techniques are the Ridge Regression based on L2 norm and Lasso Regression based on L1 norm.
The idea behind the regularization is to introduce a little bias in the Machine Learning Model while significantly decreasing the variance.
Ridge Regression
Let’s look at multiple linear regression examples with p independent variables or predictors that are used to model the dependent variable y. You might also recall that the most popular estimation technique to estimate the parameters of linear regression, assuming its assumptions are satisfied is the Ordinary Least Squares (OLS) which finds the optimal coefficients by minimizing the residual sum of squares (RSS) of the model (more about this you can read here). That is:
where the β represents the coefficient estimates for different variables or predictors(X).
Ridge Regression is pretty similar to OLS, except that the coefficients are estimated by minimizing a slightly different cost or loss function. Namely, the Ridge Regression coefficient estimates βˆR values such that they minimize the following loss function:
where λ (lambda, which is always positive, ≥ 0) is the tuning parameter or the penalty parameter, and as can be seen from this formula, in the case of the Ridge, the L2 penalty or L2 norm is used. In this way, Ridge Regression will assign a penalty to some variables shrinking their coefficients towards zero, reducing the overall model variance, but these coefficients will never become exactly zero. So, the model parameters are never set to exactly 0, which means that all p predictors of the model are still intact.
L2 Norm (Euclidean Distance)
L2 norm is a mathematical term coming from Linear Algebra and it’s standing for a Euclidean norm which can be represented as follows:
Tuning parameter λ
The tuning parameter λ serves to control the relative impact of the penalty on the regression coefficient estimates. When λ = 0, the penalty term has no effect, and the ridge regression will produce the ordinary least squares estimates. However, as λ → ∞ (gets very large), the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates approach to 0 .
Why does Ridge Regression Work?
Ridge regression’s advantage over ordinary least squares is coming from the earlier introduced bias-variance trade-off phenomenon. As λ, the penalty parameter, increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.
Pros
- solves overfitting
- easy to understand
Cons
- low model interpretability if p is large
Ridge Regression will assign a penalty (λ) to some variables shrinking their coefficients towards zero but they will never become exactly zero.
Lasso Regression
One of the biggest disadvantages of Ridge Regression is that it will include all p predictors in the final model. So, large lambda will assign a penalty to some variables shrinking their coefficients towards zero but they will never become exactly zero which becomes a problem when your model has a large number of features and your model has low interpretability.
Lasso Regression overcomes this disadvantage of Ridge Regression. Namely, the Lasso Regression coefficient estimates βˆλL are the values that minimize:
As with Ridge Regression, the Lasso shrinks the coefficient estimates towards zero. However, in the case of the Lasso, the L1 penalty or L1 norm is used which has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is significantly large. Hence, like many feature selection techniques, Lasso Regression performs variable selection besides solving the overfitting problem.
L1 Norm (Manhattan Distance)
L1 norm is a mathematical term coming from Linear Algebra and it’s standing for a Manhattan norm which can be represented as follows:
Why does Lasso Regression Work?
Like, Ridge Regression, Lasso Regression’s advantage over ordinary least squares is coming from the earlier introduced bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. Additionally, Lasso also performs feature selection.
Pros
- solves overfitting
- easy to understand
- improves model interpretability
Cons
- decreases the variance of the model less compared to Ridge Regression
Lasso Regression shrinks the coefficient estimates towards zero and even forces some of these coefficients to be exactly equal to zero when the tuning parameter λ is significantly large. So, like many features selection techniques, Lasso Regression performs variable selection besides solving the overfitting problem.
Comparison between Ridge Regression and Lasso Regression becomes clear when putting earlier two graphs next to each other.