1. Introduction

After two introductory post-series (linear and logistic regression), we dive into a crucial topic that every machine-learning practitioner should be at least aware of: model selection.

Basically, we do not want our models to learn our data by heart and then to struggle to handle new unseen data samples. We want them to be great at generalizing.

We have introduced the bias and variance concepts in Part1 and the bias-variance dilemma, the model capacity, the training/testing split practice and learning curves analysis in Part2. We moved to the cross-validation and regularization techniques in Part3, implemented the Ridge regression in Python in Part4.

We apply the same process to a linear binary classification problem. We first tackle the basics, where we compare a simple with a more complex model and we go a bit deeper in the next post.

2. Dataset

We start by generating a synthetic dataset for a binary classification problem. You can have a look at this post for other examples and ways to generate some artificial datasets to play with.

Here we create two multivariate normal distributions in a 2D domain with the multivariate_normal function.

The covariance matrix cov is fixed. It squeezes and rotates the data points’ cloud to have an elliptical shape. We use two different centres for the distribution, mean. In this way, the two clouds have some narrow overlap but they can be separated with a line. We combine the x- and y-coordinates of the two clouds with concatenate to get two single arrays for each dimension. We horizontally stack with np.c_[] to get the final 2D input array XX.

The model output is the class of the cloud, either 0 for purple data points or 1 for yellow points.

np.random.seed(0)
def genData(Npnt=1000):
    dt1 = np.random.multivariate_normal(mean=np.array([2, 1]), cov=np.array([[1, 1], [1, 3]]), size=int(Npnt/2))
    xx1, yy1 = dt1[:,0], dt1[:,1]
    zz1 = np.zeros_like(xx1)
    dt2 = np.random.multivariate_normal(mean=np.array([-1, 1]), cov=np.array([[1, 1], [1, 3]]), size=int(Npnt/2))
    xx2, yy2 = dt2[:,0], dt2[:,1]
    zz2 = np.ones_like(xx2)
    xx = np.concatenate([xx1, xx2])
    yy = np.concatenate([yy1, yy2])
    zz = np.concatenate([zz1, zz2])
    return np.c_[xx, yy], zz

XX, YY = genData()
x1, x2 = XX[:,0], XX[:,1]

plt.figure()
plt.scatter(x1, x2, c=YY, alpha=0.5)
plt.axis('equal')
plt.tight_layout();

png

3. Linear classification problem

3.1 Low-degree model

We start classifying our data with a low-degree model:

$$ y = \sigma\big( X\cdot W \big) \quad X = [x_1, x_2, 1s] $$

where the dimension of the weight matrix $W$ is (3, 1) and $X$ (X in the code) is the horizontal concatenation of the two coordinates and the bias 1s. We train a logistic regression with no regularization by setting a very high C factor as the inverse of regularization strength.

# multi-nomial
lgr = LogisticRegression(C=1e5) # we want to ignore regularization
lgr.fit(XX, YY)
Ypred = lgr.predict(XX)

We want to explore the trained model behaviour across the whole 2D domain. We use the meshgrid function to create a 2D grid from two 1D arrays, ravel the two 2D coordinates’ arrays x1grd and x2grd and vertically stack them to a (2, Npnt*Npnt) array and transpose it to have an array structure that the predict function expects, (_, 2), since the model has learnt two weights for the two features. The final step is to reshape the model prediction to the 2D grid shape to visualize it as a contour plot.

Be aware the bias is internally inserted into the model, so there is no need to feed a column of 1s together with the XX array.

Npnt = 50  # number of points of the mesh
mrg = .5
x1min, x1max = x1.min() - mrg, x1.max() + mrg
x2min, x2max = x2.min() - mrg, x2.max() + mrg
x1grd, x2grd = np.meshgrid(np.linspace(x1min, x1max, Npnt), np.linspace(x2min, x2max, Npnt))
ygrd = lgr.predict(np.vstack((x1grd.ravel(), x2grd.ravel())).T)
ygrd = ygrd.reshape(x1grd.shape)

The figure shows the dataset points from the two classes with yellow and purple colours, the contour plot of the model outcome for the 2D grid and the decision boundary as a red line.

It can be easily achieved with the contour plot itself, by setting the levels attribute to the output level we want to highlight. A logistic regression shifts from one class to the other one when the probability exceeds 0.5.

plt.figure(figsize=(10, 5))
# contour
plt.contourf(x1grd, x2grd, ygrd, cmap=plt.cm.Paired, alpha=0.4)
plt.title("Decision surface of Logistic Regression")
plt.axis('tight')
# dataset
plt.scatter(x1, x2, c=YY, cmap='viridis')
plt.xlabel("X1")
plt.ylabel("X2")
# decision boundary
cs = plt.contour(x1grd, x2grd, ygrd, levels = [0.5], colors=('r',), linestyles=('-',),linewidths=(3,));
#plt.axis('equal')
plt.tight_layout()

png

3.2 High-degree model

We repeat the same classification procedure with a third-degree model:

$$ y = \sigma\big( X\cdot W \big) $$

$$ X = [x_1^3, x_2^3, x_1^2, x_2^2, x_1, x_2, 1s] $$

where the dimension of the weight matrix $W$ is (7, 1) and $X$ (X in the code) is the horizontal concatenation of the six polynomial coordinates’ terms and the bias 1s.

We still train a logistic regression with no regularization.

Since the PolynomialFeatures class lets us produce either the full polynomial terms or the interaction features only (features that are products of at most degree distinct input features, as you might see at the interaction_only description), we build our own function to take only the polynomial terms with no interaction.

def polyFeatNoInteraction(XX, dgr):
    return np.hstack((XX**(ii+1) for ii in range(dgr)))

We transform the linear input, scale and feed it into the logistic model.

dgr = 3
pf = PolynomialFeatures(dgr, include_bias=False)
scl = StandardScaler()
XXnl = polyFeatNoInteraction(XX, dgr)
XXnl = scl.fit_transform(XXnl)
# multi-nomial
lgr = LogisticRegression(C=1e5)
lgr.fit(XXnl, YY)
Ypred = lgr.predict(XXnl)

We use the earlier-defined 2D meshgrid, preprocess it to get the input array, Before feeding it to the predict function, we need to transform and scale the grid input data as well.

XXgrd = np.vstack((x1grd.ravel(), x2grd.ravel())).T
ygrd = lgr.predict(scl.transform(polyFeatNoInteraction(XXgrd, dgr)))
ygrd = ygrd.reshape(x1grd.shape)

The figure shows the same comparison as before. The only difference is the decision boundary between the two classes. It is much more sensitive to the data points at the extreme sides of the domain, while it cannot do anything better in the overlapping region.

plt.figure(figsize=(10, 5))
# contour
plt.contourf(x1grd, x2grd, ygrd, cmap=plt.cm.Paired, alpha=0.4)
plt.title("Decision surface of Multinomial classification with Logistic Regression")
plt.axis('tight')
# dataset
plt.scatter(x1, x2, c=YY, cmap='viridis')
plt.xlabel("X1")
plt.ylabel("X2")
# decision boundary
cs = plt.contour(x1grd, x2grd, ygrd, levels = [0.5], colors=('r',), linestyles=('-',),linewidths=(3,));

png