Basically, we do not want our models to learn our data by heart and then to struggle to handle new unseen data samples. We want them to be great at generalizing.
We have introduced the bias and variance concepts in Part1 and the bias-variance dilemma, the model capacity, the training/testing split practice and learning curves analysis in Part2. We moved to the cross-validation and regularization techniques in Part3, implemented the Ridge regression in Python in Part4.
We apply the same process to a linear binary classification problem. We first tackle the basics, where we compare a simple with a more complex model and we go a bit deeper in the next post.
We start by generating a synthetic dataset for a binary classification problem. You can have a look at this post for other examples and ways to generate some artificial datasets to play with.
Here we create two multivariate normal distributions in a 2D domain with the
The covariance matrix
cov is fixed.
It squeezes and rotates the data points’ cloud to have an elliptical shape.
We use two different centres for the distribution,
In this way, the two clouds have some narrow overlap but they can be separated with a line.
We combine the x- and y-coordinates of the two clouds with
concatenate to get two single arrays for each dimension.
We horizontally stack with
np.c_ to get the final 2D input array
The model output is the class of the cloud, either 0 for purple data points or 1 for yellow points.
def genData(Npnt=1000): dt1 = np.random.multivariate_normal(mean=np.array([2, 1]), cov=np.array([[1, 1], [1, 3]]), size=int(Npnt/2)) xx1, yy1 = dt1[:,0], dt1[:,1] zz1 = np.zeros_like(xx1) dt2 = np.random.multivariate_normal(mean=np.array([-1, 1]), cov=np.array([[1, 1], [1, 3]]), size=int(Npnt/2)) xx2, yy2 = dt2[:,0], dt2[:,1] zz2 = np.ones_like(xx2) xx = np.concatenate([xx1, xx2]) yy = np.concatenate([yy1, yy2]) zz = np.concatenate([zz1, zz2]) return np.c_[xx, yy], zz XX, YY = genData() x1, x2 = XX[:,0], XX[:,1] plt.figure() plt.scatter(x1, x2, c=YY, alpha=0.5) plt.axis('equal') plt.tight_layout();
3. Linear classification problem
3.1 Low-degree model
We start classifying our data with a low-degree model:
$$ y = \sigma\big( X\cdot W \big) \quad X = [x_1, x_2, 1s] $$
where the dimension of the weight matrix $W$ is
(3, 1) and $X$ (
X in the code) is the horizontal concatenation of the two coordinates and the bias
We train a logistic regression with no regularization by setting a very high
C factor as the inverse of regularization strength.
# multi-nomial lgr = LogisticRegression(C=1e5) # we want to ignore regularization lgr.fit(XX, YY) Ypred = lgr.predict(XX)
We want to explore the trained model behaviour across the whole 2D domain.
We use the
meshgrid function to create a 2D grid from two 1D arrays, ravel the two 2D coordinates’ arrays
x2grd and vertically stack them to a
(2, Npnt*Npnt) array and transpose it to have an array structure that the
predict function expects,
(_, 2), since the model has learnt two weights for the two features.
The final step is to reshape the model prediction to the 2D grid shape to visualize it as a
Be aware the bias is internally inserted into the model, so there is no need to feed a column of
1s together with the
Npnt = 50 # number of points of the mesh mrg = .5 x1min, x1max = x1.min() - mrg, x1.max() + mrg x2min, x2max = x2.min() - mrg, x2.max() + mrg x1grd, x2grd = np.meshgrid(np.linspace(x1min, x1max, Npnt), np.linspace(x2min, x2max, Npnt)) ygrd = lgr.predict(np.vstack((x1grd.ravel(), x2grd.ravel())).T) ygrd = ygrd.reshape(x1grd.shape)
The figure shows the dataset points from the two classes with yellow and purple colours, the contour plot of the model outcome for the 2D grid and the decision boundary as a red line.
It can be easily achieved with the
contour plot itself, by setting the
levels attribute to the output level we want to highlight.
A logistic regression shifts from one class to the other one when the probability exceeds
plt.figure(figsize=(10, 5)) # contour plt.contourf(x1grd, x2grd, ygrd, cmap=plt.cm.Paired, alpha=0.4) plt.title("Decision surface of Logistic Regression") plt.axis('tight') # dataset plt.scatter(x1, x2, c=YY, cmap='viridis') plt.xlabel("X1") plt.ylabel("X2") # decision boundary cs = plt.contour(x1grd, x2grd, ygrd, levels = [0.5], colors=('r',), linestyles=('-',),linewidths=(3,)); #plt.axis('equal') plt.tight_layout()
3.2 High-degree model
We repeat the same classification procedure with a third-degree model:
$$ y = \sigma\big( X\cdot W \big) $$
$$ X = [x_1^3, x_2^3, x_1^2, x_2^2, x_1, x_2, 1s] $$
where the dimension of the weight matrix $W$ is
(7, 1) and $X$ (
X in the code) is the horizontal concatenation of the six polynomial coordinates’ terms and the bias
We still train a logistic regression with no regularization.
PolynomialFeatures class lets us produce either the full polynomial terms or the interaction features only (features that are products of at most degree distinct input features, as you might see at the
interaction_only description), we build our own function to take only the polynomial terms with no interaction.
def polyFeatNoInteraction(XX, dgr): return np.hstack((XX**(ii+1) for ii in range(dgr)))
We transform the linear input, scale and feed it into the logistic model.
dgr = 3 pf = PolynomialFeatures(dgr, include_bias=False) scl = StandardScaler() XXnl = polyFeatNoInteraction(XX, dgr) XXnl = scl.fit_transform(XXnl) # multi-nomial lgr = LogisticRegression(C=1e5) lgr.fit(XXnl, YY) Ypred = lgr.predict(XXnl)
We use the earlier-defined 2D meshgrid, preprocess it to get the input array,
Before feeding it to the
predict function, we need to transform and scale the grid input data as well.
XXgrd = np.vstack((x1grd.ravel(), x2grd.ravel())).T ygrd = lgr.predict(scl.transform(polyFeatNoInteraction(XXgrd, dgr))) ygrd = ygrd.reshape(x1grd.shape)
The figure shows the same comparison as before. The only difference is the decision boundary between the two classes. It is much more sensitive to the data points at the extreme sides of the domain, while it cannot do anything better in the overlapping region.
plt.figure(figsize=(10, 5)) # contour plt.contourf(x1grd, x2grd, ygrd, cmap=plt.cm.Paired, alpha=0.4) plt.title("Decision surface of Multinomial classification with Logistic Regression") plt.axis('tight') # dataset plt.scatter(x1, x2, c=YY, cmap='viridis') plt.xlabel("X1") plt.ylabel("X2") # decision boundary cs = plt.contour(x1grd, x2grd, ygrd, levels = [0.5], colors=('r',), linestyles=('-',),linewidths=(3,));