1. Introduction

After two introductory post-series (linear and logistic regression), we dive into a crucial topic that every machine-learning practitioner should be at least aware of: model selection.

Basically, we do not want our models to learn our data by heart and then to struggle to handle new unseen data samples. We want them to be great at generalizing.

We have introduced the bias and variance concepts in Part1 and the bias-variance dilemma, the model capacity, the training/testing split practice and learning curves analysis in Part2. We moved to the cross-validation and regularization techniques in Part3, implemented the Ridge regression in Python in Part4.

After having shown the effects of the model complexity to a linear binary classification problem in the previous post, we now go a bit deeper to inspect the model accuracy variability, the application of learning and validation curves with Scikit-learn, selection of the best model capacity with cross-validation with different regularization loss definitions.

2. Training and testing accuracy variability

One key point in the machine-learning field regards how the accuracy of the model might change for varying datasets. If we sample new data from the same distribution, which comes with noise, is our model accuracy going to change? If so, to what extent? This is what the research institute OpenAI conceptualize as attacking ML models with adversarial examples in this interesting post.

In this toy case, we sample our dataset Nsim times to correlate the accuracy variability with the noise that changes from one withdraw to the next. For each sample, we create data with our own function genData, split into training and test sets with train_test_split, scale the training set first with StandardScaler.fit_transform and the test set with .transform to use the same scaling parameters. We apply logistic regression with no regularization, return the model prediction for both training and test sets’ inputs, Xtrain and Xtest. We use the accuracy as metrics and store into two arrays for both sets. We repeat this process for four different polynomial degrees ranging from 1 to 4.

Nsim = 150
accTrainss, accTestss = [], []
degrees = np.arange(1, 5)
for dgr in degrees:
    accTrains, accTests = [], []
    for kk in range(Nsim):
        XX, YY = genData()
        Xtrain, Xtest, Ytrain, Ytest = train_test_split(XX, YY, test_size=0.2, random_state=42)
        scl = StandardScaler()
        Xtrain = scl.fit_transform(polyFeatNoInteraction(Xtrain, dgr))
        Xtest = scl.transform(polyFeatNoInteraction(Xtest, dgr))
        # fit new training data with logistic regression
        lgr = LogisticRegression(C=1e5)
        lgr.fit(Xtrain, Ytrain)
        YpredTR = lgr.predict(Xtrain)
        YpredTS = lgr.predict(Xtest)
        accTrains.append(metrics.accuracy_score(Ytrain, YpredTR))
        accTests.append(metrics.accuracy_score(Ytest, YpredTS))
    accTrainss.append(accTrains)
    accTestss.append(accTests)

accuraciesTrain = np.array(accTrainss)
accuraciesTest = np.array(accTestss)

We visualize the accuracy variability of each model complexity (degree) with a powerful technique, the boxplot, which indicates variability outside the upper and lower quartiles. A boxplot is a method for graphically depicting groups of numerical data through their quartiles. What we need now to read this chart is: 1. The horizontal orange line defines the median. 2. The bottom and top sides of the box give the 25th and 75th percentiles, respectively. 3. The box height gives the interquartile range (IQR). 4. The two horizontal ticks that confines the vertical line specify the minimum and maximum. The minimum is the difference between the 25th percentile and the $1.5\cdot IQR$. 5. The dots, if any, illustrate the distribution outliers.

You can read this post for further details.

This chart is very useful to go beyond the average accuracy comparison and inspect the actual statistical behaviour of the model. In this case, we can see how the median accuracy does not change for different degree values and from training to test sets. Instead, the test accuracy boxes are way more stretched with some outliers reaching levels faraway one to another (92% to 100%).

plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.boxplot(accuraciesTrain.T)
plt.xlabel("X")
plt.ylabel("Y")
plt.ylim([0.9, 1.025])
plt.subplot(122)
plt.boxplot(accuraciesTest.T)
plt.xlabel("X")
plt.ylabel("Y")
plt.ylim([0.9, 1.025])

plt.show()

png

3. Learning curves

We apply the learning-curve analysis to this case. To make the code much more compact, we exploit two nice features from Scikit-learn: make_pipeline and learning_curve.

The former method packs a sequence of functionalities into a single model instance. In our case, we stack the polynomial transformation, the scaling and the logistic model. We then use the latter to train and test the model on a given dataset for increasing samples. We specify the training set size as a 1D array with the train_sizes attribute. We use here the relative definition, so the train_sizes values range from 10% to 100% of the whole training set. One of the outputs, trainSizes, stores the actual set sizes.

XX, YY = genData()
x1, x2 = XX[:,0], XX[:,1]
model = make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression(C=1e5))
XY = np.c_[XX, YY.reshape(-1, 1)]
np.random.shuffle(XY)
XX, YY = XY[:, :2], XY[:, 2]
trainSizes, trainScores, valScores = learning_curve(model, XX, YY, train_sizes=np.logspace(-1, 0, 20))

The figure shows the accuracy trend for training and validation sets. We see the validation trend to keep increasing if more data are used, but the accuracy span is somehow confined.

plt.figure()
plt.plot(trainSizes, trainScores.mean(axis=1), label='training')
plt.plot(trainSizes, valScores.mean(axis=1), label='cross-validation')
plt.legend();
<matplotlib.legend.Legend at 0x1c41d9dbf60>

png

We extend this methodology to a third-degree polynomial function.

model = make_pipeline(PolynomialFeatures(degree=3), StandardScaler(), LogisticRegression(C=1e5))
trainSizes, trainScores, valScores = learning_curve(model, XX, YY, train_sizes=np.logspace(-1, 0, 20))

The figure clearly highlights in this case how the training set size helps to narrow down the training/validation accuracy gap, at the expense of a training accuracy drop of 3-4%.

plt.figure()
plt.plot(trainSizes, trainScores.mean(axis=1), label='training')
plt.plot(trainSizes, valScores.mean(axis=1), label='cross-validation')
plt.legend();

png

4. Validation curves

If the learning curves help us to find the good training set size, the validation curves are useful to select the most suitable hyper-parameter (or set of parameters). To this end, we use the validation_curve method that runs the piped model for a set of different parameter values, degrees.

degrees = np.arange(1, 6)
model = make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression(C=1e5))
# Vary the "degrees" on the pipeline step "polynomialfeatures"
trainScores, valScores = validation_curve(model, XX, YY, param_name='polynomialfeatures__degree', param_range=degrees)

The plot suggests that a first- or second-degree polynomial function are a good compromise to reduce the error gap.

# Plot the mean train score and validation score across folds
plt.figure()
plt.plot(degrees, trainScores.mean(axis=1), label='training')
plt.plot(degrees, valScores.mean(axis=1), label='cross-validation')
plt.legend(loc='best');

png

5. Regularization with cross-validation

Rather than finding the best model complexity degree with the validation curves, we could set a very high-dimensional model (degree=6) and seek for the optimal regularization parameter in the logistic regression, C. For every value in alphas, we create the piped model with the inverse regularization factor C=alpha and estimate the accuracy with a 3-fold (cv=3) cross-validation strategy.

alphas = np.logspace(-3, 5, 20)
scores = []
for alpha in alphas:
    model = make_pipeline(PolynomialFeatures(degree=6), StandardScaler(), LogisticRegression(C=alpha))
    scores.append(cross_val_score(model, XX, YY, cv=3).mean())

The plot suggests that the best alpha value lies around $10^{-1} = 0.1$. A too high regularization factor (low alpha) detriments the accuracy much more severely than a too low, but still a good tradeoff leads to 1% higher validation accuracy and a more robust model.

plt.figure()
plt.plot(np.log10(alphas), scores)
[<matplotlib.lines.Line2D at 0x1eb3a73d668>]

png

6. Regularization with nested cross-validation

We now extend this approach to a nested cross-validation analysis that combines the regularization factor and the parameters’ loss function, with lasso as l1 and ridge as l2. We change the logistic regression solver to liblinear to handle the l1 loss function as well.

penalties = ['l1', 'l2']
alphas = np.logspace(-3, 5, 20)
scores = []
for penalty in penalties:
    scores_ = []
    for alpha in alphas:
        model = make_pipeline(PolynomialFeatures(degree=6), StandardScaler(),\
                              LogisticRegression(C=alpha, penalty=penalty, solver='liblinear'))
        scores_.append(cross_val_score(model, XX, YY, cv=3).mean())
    scores.append(scores_)
scores = np.array(scores)

The plot suggests that the best alpha value lies again around $10^-1 = 0.1$ for both loss definitions. A too high regularization factor (low alpha) detriments the accuracy much more severely than a too low, but still a good tradeoff leads to 1% higher validation accuracy and a more robust model.

plt.figure()
plt.plot(np.log10(alphas), scores.T)
plt.legend(['Lasso', 'Ridge'])
plt.ylim([0.7, 1]);

png