1. Introduction

After two introductory post-series (linear and logistic regression), we dive into a crucial topic that every machine-learning practitioner should be at least aware of: model selection.

Basically, we do not want our models to learn our data by heart and then to struggle to handle new unseen data samples. We want them to be great at generalizing.

We have introduced the bias and variance concepts in Part1 and the bias-variance dilemma, the model capacity, the training/testing split practice and learning curves analysis in Part2. We moved to the cross-validation and regularization techniques in Part3, implemented the Ridge regression in Python in Part4.

We apply the same workflow of the mini-series about model selection for a linear binary classification problem (first and second posts) to a non-linear case, where the model needs to separate two circular clouds with a circle.

Let’s inspect the model accuracy variability, the application of learning and validation curves with Scikit-learn, selection of the best model capacity with cross-validation with different regularization loss definitions.

2. Training and testing accuracy variability

As we have explained in this post, one key point in the machine-learning field regards how the accuracy of the model might change for varying datasets.

We assess the accuracy variability by applying the pipeline (sampling from the dataset distribution, training the model and predicting the outcome for training and test sets) Nsim times. We repeat this process for five different polynomial degrees ranging from 1 to 5.

Nsim = 150
accTrainss, accTestss = [], []
degrees = np.arange(1, 6)
for dgr in degrees:
    accTrains, accTests = [], []
    for kk in range(Nsim):
        XX, YY = genCircles()
        Xtrain, Xtest, Ytrain, Ytest = train_test_split(XX, YY, test_size=0.2, random_state=42)
        scl = StandardScaler()
        pf = PolynomialFeatures(dgr, include_bias=False)
        Xtrain = scl.fit_transform(pf.fit_transform(Xtrain))
        Xtest = scl.transform(pf.fit_transform(Xtest))
        # fit new training data with logistic regression
        lgr = LogisticRegression(C=1e5)
        lgr.fit(Xtrain, Ytrain)
        YpredTR = lgr.predict(Xtrain)
        YpredTS = lgr.predict(Xtest)
        accTrains.append(metrics.accuracy_score(Ytrain, YpredTR))
        accTests.append(metrics.accuracy_score(Ytest, YpredTS))
    accTrainss.append(accTrains)
    accTestss.append(accTests)

accuraciesTrain = np.array(accTrainss)
accuraciesTest = np.array(accTestss)

We visualize the accuracy variability of each model complexity (degree) with a boxplot.

This chart is very useful to inspect the actual statistical behaviour of the model. In this case, we can see how the median accuracy drastically increases from first to second degree and keeps slightly increasing until the 5-degree case over the training set. However, the second-degree case, which gives the highest values from 25th to 75th percentiles (see the box limits for degree 2), would be the best option.

plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.boxplot(accuraciesTrain.T)
plt.xlabel("X")
plt.ylabel("Y")
plt.ylim([0, 1.025])
plt.subplot(122)
plt.boxplot(accuraciesTest.T)
plt.xlabel("X")
plt.ylabel("Y")
plt.ylim([0, 1.025])

plt.show()

png

3. Learning curves

We apply the learning-curve analysis to this case to a piped model with the learning_curve function.

We use a first-degree input and the logistic model without regularization. We define the learning curve for different sizes of the training set ranging from 10% to 100% of the whole training set. It is fed as a 1D array to the learning_curve function via the train_sizes attribute.

XX, YY = genCircles(250)
model = make_pipeline(PolynomialFeatures(degree=1), StandardScaler(), LogisticRegression(C=1e5))
trainSizes, trainScores, valScores = learning_curve(model, XX, YY, train_sizes=np.logspace(-1, 0, 20))

The figure shows the accuracy trend for training and validation sets. We see both training and validation accuracies do not improve. This is due to the limit of the model, no matter what training set size is being used.

plt.figure()
plt.plot(trainSizes, trainScores.mean(axis=1), label='training')
plt.plot(trainSizes, valScores.mean(axis=1), label='cross-validation')
plt.ylim([0.4, 1])
plt.legend();

png

We extend this methodology to a 2-degree polynomial function. The figure clearly highlights in this case how the training set size helps to narrow down the training/validation accuracy gap, at the expense of a training accuracy drop of 15%.

model = make_pipeline(PolynomialFeatures(degree=2), StandardScaler(), LogisticRegression(C=1e5))
trainSizes, trainScores, valScores = learning_curve(model, XX, YY, train_sizes=np.logspace(-1, 0, 20))

plt.figure()
plt.plot(trainSizes, trainScores.mean(axis=1), label='training')
plt.plot(trainSizes, valScores.mean(axis=1), label='cross-validation')
plt.ylim([0.4, 1])
plt.legend();

png

4. Validation curves

Let’s use the validation curves to select the most suitable model degree, as the sole hyper-parameter. To this end, we use the validation_curve method that runs the piped model for a set of different parameter values, degrees.

The plot suggests that the second-degree polynomial function gives a huge boost to the model accuracy of both training and validation sets. A more complex model does not show any improvements.

degrees = np.arange(1, 6)
model = make_pipeline(PolynomialFeatures(), StandardScaler(), LogisticRegression(C=1e5))
# Vary the "degrees" on the pipeline step "polynomialfeatures"
trainScores, valScores = validation_curve(model, XX, YY, param_name='polynomialfeatures__degree', param_range=degrees)
# Plot the mean train score and validation score across folds
plt.figure()
plt.plot(degrees, trainScores.mean(axis=1), label='training')
plt.plot(degrees, valScores.mean(axis=1), label='cross-validation')
plt.legend(loc='best');

png

5. Regularization with cross-validation

Rather than finding the best model complexity degree with the validation curves, we could set a very high-dimensional model (degree=6) and seek for the optimal regularization parameter in the logistic regression, C. For every value in alphas ranging from 1e-5 to 1e6, we create the piped model with the inverse regularization factor C=alpha and estimate the accuracy with a 3-fold (cv=3) cross-validation strategy.

The plot suggests that the best alpha value lies between 1e-1=0.1 and 1e0=1. A too high regularization factor (low alpha) reduces the accuracy more severely (5%) than a too low factor (2%).

alphas = np.logspace(-5, 6, 30)
scores = []
for alpha in alphas:
    model = make_pipeline(PolynomialFeatures(degree=6), StandardScaler(), LogisticRegression(C=alpha))
    scores.append(cross_val_score(model, XX, YY, cv=3).mean())

plt.figure()
plt.plot(np.log10(alphas), scores)
[<matplotlib.lines.Line2D at 0x1c41c7f7c88>]

png

6. Regularization with nested cross-validation

We now extend this approach to a nested cross-validation analysis that combines the regularization factor and the parameters’ loss function, with lasso as l1 and ridge as l2. We change the logistic regression solver to liblinear to handle the l1 loss function as well.

The plot suggests that the best alpha value lies again between 1e-1=0.1 and 1e0=1 for the lasso loss and between 1e-1=0.1 and 1e1=10 for the ridge loss. A very high regularization factor affects the overall accuracy a lot more if combined with the lasso loss definition, which drops down to 50%.

penalties = ['l1', 'l2']
alphas = np.logspace(-5, 5, 30)
scores = []
for penalty in penalties:
    scores_ = []
    for alpha in alphas:
        model = make_pipeline(PolynomialFeatures(degree=6), StandardScaler(),\
                              LogisticRegression(C=alpha, penalty=penalty, solver='liblinear'))
        scores_.append(cross_val_score(model, XX, YY, cv=3).mean())
    scores.append(scores_)
scores = np.array(scores).T

plt.figure()
plt.plot(np.log10(alphas), scores)
plt.legend(['Lasso', 'Ridge']);
<matplotlib.legend.Legend at 0x1c41cb3bd30>

png

7. Visualize the best model behaviour

We select the best hyper-parameters from the previous step, wrt the accuracy values stored in scores. Since it is a 2D array, we need the row and column indexes of the minimum value. Numpy provides the unravel_index function to transform the index of a given element of the flatten array scores into a (row, col) index.

idxs = np.unravel_index(scores.argmax(), scores.shape)
alphaOpt = alphas[idxs[0]]
penOpt = penalties[idxs[1]]

We can once again create a model instance for the two best hyper-parameters and outcome the model prediction for the 2D grid.

mdlOpt = make_pipeline(PolynomialFeatures(degree=6), StandardScaler(),\
                              LogisticRegression(C=alphaOpt, penalty=penOpt, solver='liblinear'))
mdlOpt.fit(XX, YY);

Npnt = 50  # number of points of the mesh
mrg = .5
x1, x2 = XX[:, 0], XX[:, 1]
x1min, x1max = x1.min() - mrg, x1.max() + mrg
x2min, x2max = x2.min() - mrg, x2.max() + mrg
x1grd, x2grd = np.meshgrid(np.linspace(x1min, x1max, Npnt), np.linspace(x2min, x2max, Npnt))

XXgrd = np.vstack((x1grd.ravel(), x2grd.ravel())).T
ygrd = mdlOpt.predict(XXgrd)
ygrd = ygrd.reshape(x1grd.shape)

The final model is able to discriminate the two clouds with a quasi-circular shape as a decision boundary (red line). The model loss is simply due to the noise, which is the only term the model cannot get rid of. A proper regularization process can help to select the best model capacity and complexity wrt the bias and variance on the drawn data.

plt.figure(figsize=(10, 5))
# contour
plt.contourf(x1grd, x2grd, ygrd, cmap=plt.cm.Paired, alpha=0.4)
plt.title("Decision surface of Multinomial classification with Logistic Regression")
plt.axis('tight')
# dataset
plt.scatter(XX[:,0], XX[:,1], c=YY, cmap='viridis')
plt.xlabel("X1")
plt.ylabel("X2")
# decision boundary
cs = plt.contour(x1grd, x2grd, ygrd, levels = [0.5], colors=('r',), linestyles=('-',),linewidths=(3,));

png