1. Introduction

We have introduced in the previous post the concept of the linear-regression problem and the structure to solve it in a “machine-learning” fashion. In this post we apply the theory to a simple but practical case of linear-behavior identification from a bunch of data that are generated in a synthetic way.

We are going to implement the logic from scratch in Python and its powerful numerical library, Numpy.


Figure 1 - Machine learning general structure

2. Data generation

First we start generating some synthetic data (Npnts = 500 points). We assume we know both the slope (m = 5) and the intercept (q = -3) of the line we want to identify, but we also introduce some noise with a gaussian distribution and zero-mean to the line to make the data source a bit closer to real-world scenarios.

The chart shows the generated data cloud that we feed to the learning algorithm to identify the line specifications.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
Npnts = 500 # number of points
xx = np.linspace(0,3,Npnts)
noise = 0.5*(np.random.randn(Npnts)-1)
mm, qq = 5, -3
yy = mm*xx + qq + noise

plt.figure(figsize=(10, 5))
plt.scatter(xx, yy, alpha=0.75)


The dataset is generated by creating two 2D arrays, one for inputs and one for outputs. The input array, XX, is the horizontal concatenation of a column of 1s, as many as the length of the initial 1D array xx, and the array itself. We first stack the two 1D arrays vertically and then transpose it to get the examples (500) over the rows and the features over the columns (2). The output 2D array is just a single column filled with the y values. Here the shape of the arrays.

XX = np.vstack((np.ones_like(xx), xx)).T
YY = yy.reshape(-1,1)
[XX.shape, YY.shape]
[(500, 2), (500, 1)]
XX = np.vstack((np.ones_like(xx), xx)).T
ww = np.array([0, -4]).reshape(-1, 1)
YY = yy.reshape(-1,1)

3. Loss function

Let us recall the loss function, $J$, where $x^j$ and $y^j$ are the input and output of the j-th example and $n$ is the number of features (2):

$$ J = \frac{1}{2}\sum_{j=1}^{m} (y^j- \sum_k^{n} (\theta_k \cdot x^j_k) )^2 $$

$$ J = \frac{1}{2}\sum_{j=1}^{m} (y^j-\theta^T \cdot x^j)^2 $$

The model parameters (or weights), $\theta$, are combined with the j-th example input, $x^j$, to return the predicted output. It is then compared to the actual output, $y^j$, in terms of squared difference. The following code gives the error of the j-th example, where the first method uses for-loop while the second one employs matrix multiplication. The first method is chosen for exemplification, the second one is chosen for efficiency and therefore adopted in real implementation.

Nex, Nfeat = XX.shape # examples, features
ww = np.array([0, -4]).reshape(-1, 1)

jj = np.random.randint(Nex) # pick a random example
y_ = 0
for kk in range(Nfeat):
    y_ += ww[kk,0]*XX[jj, kk]

(YY[jj, 0] - y_)**2
(np.dot(XX[jj,:], ww[:,0]) - YY[jj,0])**2

If the full matrix product between input and parameter arrays is taken, we obtain a 2D array with a single column that stores the predictions over the set of examples. Its shape is identical to that of the actual output, y. The overall loss is the sum of every error contribution, over the row axis of the examples, which is indexed with 0 in Numpy. The final loss function implementation is as simple as what follows.

def lossFunction(XX, YY, ww):
    Npnt = XX.shape[0]
    J = np.sum((np.dot(XX, ww) - YY)**2, axis=0)/2/Npnt
    return J

lossFunction(XX, YY, ww)
array([ 80.93621425])

4.Gradient descent

We need to calculate the gradient of the loss function to apply the gradient descent algorithm to the linear regression problem. We calculate the $k$ element of the gradient (Eq. (1)):

$$\frac{\partial J}{\partial \theta_k} = \frac{\partial}{\partial \theta_k}\frac{1}{2}\sum_j^{m} (y^j-\theta^T \cdot x^j)^2 $$

$$ = \frac{1}{2}\sum_{j=1}^{m} \frac{\partial}{\partial \theta_k}(y^j-\theta^T \cdot x^j)^2 $$

$$ = 2\frac{1}{2}\sum_{j=1}^{m} (y^j-\theta^T \cdot x^j)\frac{\partial}{\partial \theta_k}(y^j-\theta^T \cdot x^j) $$

$$ = \sum_{j=1}^{m} (y^j-\theta^T \cdot x^j)\cdot x_k^j $$

Since the vector format of the linear regression expands to a sum as:

$$ \theta^T \cdot x^j = \sum_{k=0}^n \theta_k\cdot x_k^j $$

the derivative of this vector product with respect to the parameter $\theta_k$ is $x_k^j$.

Keep in mind that the input feature associated with the first parameter $\theta_0$, which is the intercept, is always 1.

Code-wise, a vectorized implementation of the parameter update step is fundamental to achieve fast computation. The input data are stored into a 2D array, XX, where m and n+1 are the number of rows and columns, respectively. In the below code example, m=6 and n=1. Parameters are allocated into a 2D array, ww, with n+1 rows and one column. Numpy dot operator returns the matrix multiplication of the two 2D arrays, whose size is (m, 1). We then subtract element-wise the result with the ground-truth output, y, and broadcast the resulting (m, 1) array, $\epsilon$, to perform the element-wise product with the input array, XX.

X_ = np.arange(12).reshape(-1,2)
Y_ = np.arange(6).reshape(-1,1)
w_ = np.arange(2).reshape(-1,1)
epsilon = np.dot(X_, w_) - Y_
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]

Broadcasting is a powerful Numpy feature that enables the element-wise product of two arrays that share at least one dimension (row or column), but the size of the other dimension of one of two arrays is equal to 1. Two cases can occur that both lead to an (n, m) array:

  1. An (n,m) array with an (n,1) array, where the second array is duplicated column-wise m times
  2. An (n,m) array with an (1,m) array, where the second array is duplicated row-wise n times

The (m,1) array, $\epsilon$, represent the constant part for each parameter $\theta_k$. Since Eq. (1) states that the gradient element for $\theta_k$ is related to its corresponding feature $x_k$, we use the complete XX array to simultaneously update the entire parameter vector $\theta$.

print(epsilon * X_)
print((epsilon * X_).shape)
(6, 1)
(6, 2)
[[ 0  1]
 [ 4  6]
 [12 15]
 [24 28]
 [40 45]
 [60 66]]
(6, 2)

The sum operator is Eq. (1) is active through the examples, which translates to the Numpy sum operator over the 0-axis. However, this step returns a 1D array, with a size equal to the number of parameters to learn (n+1). We need first reshape it to have a 2D column array (n+1, 1) and apply the update step with learning rate and actual parameter vector estimate.

grad_ = np.sum((np.dot(X_, w_) - Y_) * X_, axis=0)

5. Training

Here follows the complete code of the learning process, where the model parameters are changed for Nepoch times. The parameters and the corresponding loss function value are stored in associated lists and returned at the end of the function call.

def gradientDescent(XX, YY, ww, lr=0.1, Nepoch=1500):
    Npnt = XX.shape[0]
    Jevol, wevol = [], []
    for _ in range(Nepoch):
        Jevol.append(lossFunction(XX, YY, ww))
        ww = ww - lr/Npnt * np.sum((np.dot(XX, ww) - YY) * XX, axis=0).reshape(-1,1)
    return np.array(wevol), np.array(Jevol)

wOpt and Jopt are the optimal parameter set of values and the minimum loss that are stacked at the bottom of the evolution lists. The final values for wOpt are very close to the ones used to generate the data.

J = lossFunction(XX, YY, ww)
Nepoch, lr = 1500, 0.05
wEvol, Jevol = gradientDescent(XX, YY, ww, lr, Nepoch)
wOpt, Jopt = wEvol[-1,:], Jevol[-1]
[-3.40781857  4.97147063]

The below figure compares the line with optimal parameters, wOpt, to the cloud of points used to train the model.

plt.figure(figsize=(10, 5))
plt.scatter(xx, yy, alpha=0.75)
plt.plot(xx, wOpt[0] + wOpt[1]*xx, 'r', lw=2)


6. Parameter evolution

Here we plot the evolution of the parameters $\omega$ over the Nepoch training steps. The initial and final values are depicted with the green and red points, respectively. The size of each intermediate blue point is proportional to the loss function, where the smaller the point, the lower the loss, the better the model performance.

plt.figure(figsize=(10, 5))
plt.scatter(wEvol[:,0], wEvol[:,1], s=10+100*Jevol/np.max(Jevol), alpha=0.5, label='Weight evolution')
plt.plot(wEvol[0,0], wEvol[0,1], 'g', linestyle='none', marker='o', markersize=10, alpha=0.75, label='Initial weights')
plt.plot(wEvol[-1,0], wEvol[-1,1], 'r', linestyle='none', marker='o', markersize=10, alpha=0.75, label='Optimal weights')


In the following plot, we compare the evolution of the parameters $\omega$ with the loss function 2D map. To this end, we need to create a grid for every combination of $\omega_1$ and $\omega_2$, using meshgrid. The two 2D arrays are then flattened and vertically stacked to obtain the parameter meshgrid, wmg, which contains $20*26=520$ two-element tuples. We want to calculate the loss function for each pair of parameters. The loss function requires a (2,1) array, so we need to reshape the 1D array that we get from for-iterating along the wmg rows.

The Python list comprehension helps to perform this process that returns a list of 520 loss values corresponding to every point of the meshgrid. The final step is to convert the list into an array and reshape it into a 2D whose size is as equal as the initial 2D parameter arrays, w1mg and w2mg. The plot shows the parameter evolution on top of the contour plot of the 2D loss function, where the greater the loss value the warmer the colour.

step = 0.5
w1s = np.arange(-7, 3, step)
w2s = np.arange(-5, 8, step)
w1mg, w2mg = np.meshgrid(w1s, w2s)
wmg = np.vstack((w1mg.flatten(), w2mg.flatten())).T
w1s.shape, w2s.shape, wmg.shape
((20,), (26,), (520, 2))
w_ = np.array([-3.5, 5]).reshape(-1, 1)
[w_, lossFunction(XX, YY, w_)]
        [ 5. ]]), array([ 0.1389385])]
Jlist = [lossFunction(XX, YY, wmg[kk,:].reshape(-1, 1)) for kk in range(wmg.shape[0])]
Jmap = np.array(Jlist).reshape(-1, w1s.shape[0])
plt.figure(figsize=(10, 5))
plt.contourf(w1mg, w2mg, Jmap, alpha=0.5, cmap=plt.cm.jet)
plt.scatter(wEvol[:,0], wEvol[:,1], s=10+100*Jevol/np.max(Jevol), alpha=0.5, label='Weight evolution')
plt.plot(wEvol[-1,0], wEvol[-1,1], 'r', linestyle='none', marker='o', markersize=10, alpha=0.75, label='Optimal weights')
plt.plot(wEvol[0,0], wEvol[0,1], 'g', linestyle='none', marker='o', markersize=10, alpha=0.75, label='Initial weights')


This plot reports the logarithmic trend of the loss function over the training steps.

plt.figure(figsize=(10, 5))
plt.plot(np.log(Jevol), lw=2)
plt.xlabel("training steps ($N_{epoch}$)")
plt.ylabel("Logarithm loss trend ($log(J_{evol})$)")


The final plot wants to summarize the post by visualizing the training process at those steps where the logarithm of the loss function assumes integer values. We create a list, idxs, that stores the Nrow = 6+1 indices of the Jevol array that correspond to such training steps, in addition to the final training step.

idxs = [np.where(np.log(Jevol)<kk)[0][0] for kk in range(4, -2, -1)] + [Jevol.shape[0]-1]
Nrow = len(idxs)
Nstep = int(Nepoch/Nrow)

The chart has a matrix structure, with as many rows as the training steps to analyse, Nrow, and two columns, the left-most one that compares the data point cloud to the current model (red line) and highlights the current loss (J value), the right-most one that shows the trajectory of the model parameters up to the current epoch within the 2D parameter space.

plt.figure(figsize=(15, Nrow*5))
for kk, idx in enumerate(idxs):
    plt.subplot(Nrow, 2, 2*kk+1)
    plt.scatter(xx, yy, alpha=0.75)
    plt.plot(xx, wEvol[idx,0] + wEvol[idx,1]*xx, 'r', lw=2)
    plt.title("Current J value: {0:.2f}".format(Jevol[idx,0]))    
    plt.subplot(Nrow, 2, 2*kk+2)
    plt.contourf(w1mg, w2mg, Jmap, alpha=0.5, cmap=plt.cm.jet)
    plt.scatter(wEvol[:idx,0], wEvol[:idx,1], s=10+100*Jevol/np.max(Jevol), alpha=0.5, label='Weight evolution')
    plt.plot(wEvol[idx,0], wEvol[idx,1], 'r', linestyle='none', marker='o', markersize=10, alpha=0.75, label='Current weights')
    plt.title("Current epoch: " + str(idx))




  1. CS229 notes
  2. Machine Learning at Coursera
  3. Maximum likelihood estimators and least squares
  4. An Introduction to Statistical Learning