Learning Curves in Machine Learning
Evaluation metrics for classification and regression problems are good tools to give us an understanding of the model performance. Learning curves, on the other hand, give us more insight about the training process of the model. These insights we can use to understand better how the model can be improved.
import numpy as np
np.random.seed(0)
X = 2 - 3 * np.random.normal(0, 1, 30)
y = X - 2 * (X ** 2) + np.random.normal(-10, 10, 30)
pnt_1 = [-5, 10]
pnt_2 = [0,-185]
plt.title("High Bias (Underfitting)")
plt.scatter(X,y, s=10, label='training data points')
plt.plot(pnt_1,pnt_2, color='r', label='model fit')
plt.legend(loc='best')
plt.grid()
plt.show()
Variance is the amount of variability of the predictions made by the model. It can also be defined as the amount of which the predictions change as we change the training dataset. Model with high variance captures too much of the details in the training dataset resulting in a complex model failing to generalize on test data. A complex model which is overfitting to the training data yields very a very low error rate on the training set but a high error rate on the test set.
Xs, ys = zip(*sorted(zip(X, y)))
plt.title("High Variance (Overfitting)")
plt.scatter(Xs,ys, s=10, label='training data points')
plt.plot(Xs, ys, color='r', label='model fit')
plt.legend(loc='best')
plt.grid()
plt.show()
plt.scatter(X,y, label="training data points")
p = np.polyfit(X,y,2)
xfit=np.linspace(min(X),max(X), 1000)
yfit=np.polyval(p, xfit)
plt.plot(xfit,yfit,color='r',label='model fit')
plt.grid()
plt.title("Low Bias and Low Variance - Ideal Fit")
plt.legend(loc='best')
plt.show()
If we want to keep the error of our model low, then we should try to reduce bias and variance in the model. However, it is usually not possible to have both low bias and low variance. Therefore, a trade-off is necessary and this is where the learning curves come in handy
Remembering from section MSE:
$$ Err(x) = Var(x) + (Bias(x))^2 $$
Having talked about bias and variance and the effect of model complexity on these terms. We can continue with learning curves. Learning curve tells us how the error changes as the training set size increases.
Imagine that we first have a single data point in our training set. If we fit a model to that one point, we will be achieving 0 as error because it is very easy to pefectly fit the model on a single point. However, the same model will be performing very bad on a validation set of a normal size because it won't learn enough to actually capture the behaviour in the data (ofcourse we do not have a behaviour with one data point). As we increase the number of data points in the training set, it will be more difficult to perfectly fit the model to all the data points, hence trying to generalize to them minimizing error on all points. This will introduce an error on the training data set which will increase as the training set size increases. But now the model's error on validation set starts to decrease because the model is able to capture more of the data behaviour as the training size increases.
This is exacltly what we use the learning curves for: to observe the error change on training and validation sets as the training set size increases.
We can use the diabetes dataset from sklearn to create some learning curves. Moreover, we can use the learning_curve() functions from the scikit-learn library to create the learning curves. As the model, let's use the linear regression model...
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve
X,y = datasets.load_diabetes(return_X_y=True)
X.shape
We can choose training data size range of the way upto a value of 350, which is around %80 percent of the complete data we have. (80/20 rule for train(test split).
train_sizes = [1, 25, 50, 100, 200, 350]
train_sizes, train_scores, val_scores = learning_curve(
estimator = LinearRegression(),
X = X,
y = y, train_sizes = train_sizes, cv = 5,
scoring = 'neg_root_mean_squared_error')
import pandas as pd
print('Training scores:\n', pd.DataFrame(train_scores, columns=[i+1 for i in range(5)] , index=train_sizes))
print('-' * 100)
print('Validation scores:\n', pd.DataFrame(val_scores, columns=[i+1 for i in range(5)], index=train_sizes))
We see 5 scores for each training size because of cross validations with 5 folds. We can average all the folds to obtain only one average metric for each training size. Also, because we used negative RMSE earlier, we can flip the signs back to psitive by multiplying with (-).
import pandas as pd
mean_train_scores = -train_scores.mean(axis = 1)
mean_val_scores= -val_scores.mean(axis = 1 )
print('Mean training scores\n\n', pd.Series(mean_train_scores, index = train_sizes))
print('\n', '-' * 20) # separator
print('\nMean validation scores\n\n',pd.Series(mean_val_scores, index = train_sizes))
import matplotlib.pyplot as plt
plt.plot(train_sizes, mean_train_scores, label = 'Training error')
plt.plot(train_sizes, mean_val_scores, label = 'Validation error')
plt.ylabel('RMSE')
plt.xlabel('Training set size')
plt.title('Learning curves')
plt.legend()
Here on the graph, we can observe the behaviour we just mentioned (except for the point where training size is 1. This one point seems to yield a low MSE on validation, which is totally by coincidence. The model fit on one point by coindence seems to perform better than at 25). The training error at training size 1, is equal to 0 as we expected.
We can see that the linear regression model does not perform perfectly because we have a training RMSE of around 55 where both validation and training error converges. We can see that from training size of 200, both the training and validation RMSE does not change much. Therefore, if we want a better model performance, adding more training points is not going to help us here. Instead of collecting more training data, we should try a more complex ML model/algorithm. Moreover, adding more relevant features to data can also help since it will increase model complexity as well.
Well, how can we see if we have a high bias or variance?
A sign for a high bias in the model is high validation error. Our validation error converges to around 55, which is pretty high looking at the mean of the target values (around 150). So we can say that we have a bias problem.
But do we have high bias or low bias in the model? Low bias means, that the model fits very well to training data (overfitting) resulting in a low training error. High bias means that the model is too simple to capture important behaviour (does not fit well to training data) in the training data, meaning high training error. Since our training curve also converges to the value of validation curve which we considered to be high, we can say that we have a high bias problem.
To see if we have a variance problem we can check two things:
- Observing the gap between the validation and training curve
- By watching the training error as the training set increases
High variance means that the model is fitting too well to the training data, which will result in a large validation error because it will fail to generalize. However, fitting too well to the training data will also yield a low training error. This means that we expect to have a large gap between the validation and training curves as they flatten. For low variance, the opposite of this holds. Hence, lower gaps means lower variance.
Moreover, High training error alone is also a good way to detect low variance. Low variance means, our algorithm is too simple and underfits the training data, which will result in a high training error. Hence high training alone indicates low variance problem in the model.
In our case, we can confirm that we have a small gap, hence a large training error which means a low variance in the model.
Finally we can confirm that our model has:
- High bias and low variance, underfitting the training data.
- Adding more training data points is not likely to help since both validation and training curves have converged/flattened.
At this point, the next step would be to select a complexer algorithm. Adding more features can also be an option. However, since we can not add additional features to the dataset, we would have to generate new features using feature engineering technique which is not the scope here. Moreover, if the model has regularization, decrease it would also help. We use regularization to prevent overfitting hence if we decrease the regularization, the model has a better fit on training decreasing bias and increasing variance, which is also not really the scope here.
At this stage, a complexer model should help us decrease the bias and increase the variance. Let's try the RandomForestRegressor model from sklearn library:
from sklearn.ensemble import RandomForestRegressor
train_sizes, train_scores, val_scores = learning_curve(
estimator = RandomForestRegressor(),
X = X,
y = y, train_sizes = train_sizes, cv = 5,
scoring = 'neg_root_mean_squared_error')
mean_train_scores = -train_scores.mean(axis = 1)
mean_val_scores= -val_scores.mean(axis = 1 )
plt.plot(train_sizes, mean_train_scores, label = 'Training error')
plt.plot(train_sizes, mean_val_scores, label = 'Validation error')
plt.ylabel('RMSE')
plt.xlabel('Training set size')
plt.title('Learning curves')
plt.legend()
Our training error seems to have converged to a much lower value than with linear regression model. Therefore, we can say that we have managed to decrease the bias amount. However, now we have a much larger gap between the training and validation curves, indicating a higher variance. Basically, our model fits pretty well to the training data but fails to generalize and perform well on the validation data indicating an overfitting problem. I think we are thinking of the same thing here ?! Bingo! we need to decrease the model complexity.
There a few ways to decrease model complexity with a RandomForest model and one of them is to adjust the maximum depth of each decision tree. We can do that by specifying the max_depth parameter when we define the RandomForestRegressor model.
from sklearn.ensemble import RandomForestRegressor
train_sizes, train_scores, val_scores = learning_curve(
estimator = RandomForestRegressor(max_depth=5),
X = X,
y = y, train_sizes = train_sizes, cv = 5,
scoring = 'neg_root_mean_squared_error')
mean_train_scores = -train_scores.mean(axis = 1)
mean_val_scores= -val_scores.mean(axis = 1 )
plt.plot(train_sizes, mean_train_scores, label = 'Training error')
plt.plot(train_sizes, mean_val_scores, label = 'Validation error')
plt.ylabel('RMSE')
plt.xlabel('Training set size')
plt.title('Learning curves')
plt.legend()
We have managed to decrease the gap which means we were able to reduce model complexity hence reduce variance. However, the gap was reduced only by increasing the training error and with a slight improvement of validation error. We will not spend further time trying to improve model performance on this case since we have already cleared the most important concepts about learning curves. However as a last note, some steps to improve this model further would be:
- Adding more features
- Feature engineering or selection
- Hyperparameter optimization to obtain the best hyperparameters for this model