Basically, we do not want our models to learn our data by heart and then to struggle to handle new unseen data samples. We want them to be great at generalizing.
We have introduced the bias and variance concepts in Part1 and the bias-variance dilemma, the model capacity, the training/testing split practice and learning curves analysis in Part2. We moved to the cross-validation and regularization techniques in Part3, implemented the Ridge regression in Python in Part4.
We apply the same workflow of the mini-series about model selection for a linear binary classification problem (first and second posts) to a non-linear case, where the model needs to separate two circular clouds with a circle.
Let’s inspect the model accuracy variability, the application of learning and validation curves with Scikit-learn, selection of the best model capacity with cross-validation with different regularization loss definitions.
2. Training and testing accuracy variability
As we have explained in this post, one key point in the machine-learning field regards how the accuracy of the model might change for varying datasets.
We assess the accuracy variability by applying the pipeline (sampling from the dataset distribution, training the model and predicting the outcome for training and test sets)
We repeat this process for five different polynomial
degrees ranging from 1 to 5.
Nsim = 150 accTrainss, accTestss = ,  degrees = np.arange(1, 6) for dgr in degrees: accTrains, accTests = ,  for kk in range(Nsim): XX, YY = genCircles() Xtrain, Xtest, Ytrain, Ytest = train_test_split(XX, YY, test_size=0.2, random_state=42) scl = StandardScaler() pf = PolynomialFeatures(dgr, include_bias=False) Xtrain = scl.fit_transform(pf.fit_transform(Xtrain)) Xtest = scl.transform(pf.fit_transform(Xtest)) # fit new training data with logistic regression lgr = LogisticRegression(C=1e5) lgr.fit(Xtrain, Ytrain) YpredTR = lgr.predict(Xtrain) YpredTS = lgr.predict(Xtest) accTrains.append(metrics.accuracy_score(Ytrain, YpredTR)) accTests.append(metrics.accuracy_score(Ytest, YpredTS)) accTrainss.append(accTrains) accTestss.append(accTests) accuraciesTrain = np.array(accTrainss) accuraciesTest = np.array(accTestss)
We visualize the accuracy variability of each model complexity (degree) with a boxplot.
This chart is very useful to inspect the actual statistical behaviour of the model. In this case, we can see how the median accuracy drastically increases from first to second degree and keeps slightly increasing until the 5-degree case over the training set. However, the second-degree case, which gives the highest values from 25th to 75th percentiles (see the box limits for degree 2), would be the best option.
plt.figure(figsize=(10, 5)) plt.subplot(121) plt.boxplot(accuraciesTrain.T) plt.xlabel("X") plt.ylabel("Y") plt.ylim([0, 1.025]) plt.subplot(122) plt.boxplot(accuraciesTest.T) plt.xlabel("X") plt.ylabel("Y") plt.ylim([0, 1.025]) plt.show()
3. Learning curves
We use a first-degree input and the logistic model without regularization.
We define the learning curve for different sizes of the training set ranging from 10% to 100% of the whole training set.
It is fed as a 1D array to the
learning_curve function via the
XX, YY = genCircles(250) model = make_pipeline(PolynomialFeatures(degree=1), StandardScaler(), LogisticRegression(C=1e5)) trainSizes, trainScores, valScores = learning_curve(model, XX, YY, train_sizes=np.logspace(-1, 0, 20))
The figure shows the accuracy trend for training and validation sets. We see both training and validation accuracies do not improve. This is due to the limit of the model, no matter what training set size is being used.
plt.figure() plt.plot(trainSizes, trainScores.mean(axis=1), label='training') plt.plot(trainSizes, valScores.mean(axis=1), label='cross-validation') plt.ylim([0.4, 1]) plt.legend();
We extend this methodology to a 2-degree polynomial function. The figure clearly highlights in this case how the training set size helps to narrow down the training/validation accuracy gap, at the expense of a training accuracy drop of 15%.
model = make_pipeline(PolynomialFeatures(degree=2), StandardScaler(), LogisticRegression(C=1e5)) trainSizes, trainScores, valScores = learning_curve(model, XX, YY, train_sizes=np.logspace(-1, 0, 20)) plt.figure() plt.plot(trainSizes, trainScores.mean(axis=1), label='training') plt.plot(trainSizes, valScores.mean(axis=1), label='cross-validation') plt.ylim([0.4, 1]) plt.legend();
4. Validation curves
Let’s use the validation curves to select the most suitable model degree, as the sole hyper-parameter.
To this end, we use the
validation_curve method that runs the piped
model for a set of different parameter values,
The plot suggests that the second-degree polynomial function gives a huge boost to the model accuracy of both training and validation sets. A more complex model does not show any improvements.
degrees = np.arange(1, 6) model = make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression(C=1e5)) # Vary the "degrees" on the pipeline step "polynomialfeatures" trainScores, valScores = validation_curve(model, XX, YY, param_name='polynomialfeatures__degree', param_range=degrees) # Plot the mean train score and validation score across folds plt.figure() plt.plot(degrees, trainScores.mean(axis=1), label='training') plt.plot(degrees, valScores.mean(axis=1), label='cross-validation') plt.legend(loc='best');
5. Regularization with cross-validation
Rather than finding the best model complexity
degree with the validation curves, we could set a very high-dimensional model (
degree=6) and seek for the optimal regularization parameter in the logistic regression,
For every value in
alphas ranging from
1e6, we create the piped
model with the inverse regularization factor
C=alpha and estimate the accuracy with a 3-fold (
cv=3) cross-validation strategy.
The plot suggests that the best
alpha value lies between
A too high regularization factor (low
alpha) reduces the accuracy more severely (5%) than a too low factor (2%).
alphas = np.logspace(-5, 6, 30) scores =  for alpha in alphas: model = make_pipeline(PolynomialFeatures(degree=6), StandardScaler(), LogisticRegression(C=alpha)) scores.append(cross_val_score(model, XX, YY, cv=3).mean()) plt.figure() plt.plot(np.log10(alphas), scores)
[<matplotlib.lines.Line2D at 0x1c41c7f7c88>]
6. Regularization with nested cross-validation
We now extend this approach to a nested cross-validation analysis that combines the regularization factor and the parameters’ loss function, with lasso as
l1 and ridge as
We change the logistic regression solver to
liblinear to handle the
l1 loss function as well.
The plot suggests that the best
alpha value lies again between
1e0=1 for the lasso loss and between
1e1=10 for the ridge loss.
A very high regularization factor affects the overall accuracy a lot more if combined with the lasso loss definition, which drops down to 50%.
penalties = ['l1', 'l2'] alphas = np.logspace(-5, 5, 30) scores =  for penalty in penalties: scores_ =  for alpha in alphas: model = make_pipeline(PolynomialFeatures(degree=6), StandardScaler(),\ LogisticRegression(C=alpha, penalty=penalty, solver='liblinear')) scores_.append(cross_val_score(model, XX, YY, cv=3).mean()) scores.append(scores_) scores = np.array(scores).T plt.figure() plt.plot(np.log10(alphas), scores) plt.legend(['Lasso', 'Ridge']);
<matplotlib.legend.Legend at 0x1c41cb3bd30>
7. Visualize the best model behaviour
We select the best hyper-parameters from the previous step, wrt the accuracy values stored in
Since it is a 2D array, we need the row and column indexes of the minimum value.
Numpy provides the
unravel_index function to transform the index of a given element of the flatten array
scores into a
(row, col) index.
idxs = np.unravel_index(scores.argmax(), scores.shape) alphaOpt = alphas[idxs] penOpt = penalties[idxs]
We can once again create a model instance for the two best hyper-parameters and outcome the model prediction for the 2D grid.
mdlOpt = make_pipeline(PolynomialFeatures(degree=6), StandardScaler(),\ LogisticRegression(C=alphaOpt, penalty=penOpt, solver='liblinear')) mdlOpt.fit(XX, YY); Npnt = 50 # number of points of the mesh mrg = .5 x1, x2 = XX[:, 0], XX[:, 1] x1min, x1max = x1.min() - mrg, x1.max() + mrg x2min, x2max = x2.min() - mrg, x2.max() + mrg x1grd, x2grd = np.meshgrid(np.linspace(x1min, x1max, Npnt), np.linspace(x2min, x2max, Npnt)) XXgrd = np.vstack((x1grd.ravel(), x2grd.ravel())).T ygrd = mdlOpt.predict(XXgrd) ygrd = ygrd.reshape(x1grd.shape)
The final model is able to discriminate the two clouds with a quasi-circular shape as a decision boundary (red line). The model loss is simply due to the noise, which is the only term the model cannot get rid of. A proper regularization process can help to select the best model capacity and complexity wrt the bias and variance on the drawn data.
plt.figure(figsize=(10, 5)) # contour plt.contourf(x1grd, x2grd, ygrd, cmap=plt.cm.Paired, alpha=0.4) plt.title("Decision surface of Multinomial classification with Logistic Regression") plt.axis('tight') # dataset plt.scatter(XX[:,0], XX[:,1], c=YY, cmap='viridis') plt.xlabel("X1") plt.ylabel("X2") # decision boundary cs = plt.contour(x1grd, x2grd, ygrd, levels = [0.5], colors=('r',), linestyles=('-',),linewidths=(3,));