1. Introduction

This post belongs to a new series of posts related to a huge and popular topic in machine learning: fully connected neural networks.

The series scope is three-fold:

  1. visualize the model features and characteristics with schematic pictures and charts
  2. learn to implement the model with different levels of abstraction, given by the framework used
  3. have some fun with one of the hottest topics right now!

In the previous post, we gave some geometric insight into what occurs in a single neuron. In this post, we extend this geometric intuition to what occurs in a neuron network.

2. Description

Let’s tackle the classification problem that we saw at the beginning of the previous post, where we have the 9-point grid. We first draw the first decision boundary, which is the green line. This line shall be perpendicular to the connection between A and the origin, so weight vector is something like $(k, -k) \forall k \in \mathbf{R}$. Since point A is $2\cdot\sqrt{2}$ far away from the origin, as you can see in the chart below, we want the distance of the decision line from the origin to be $\frac{1}{4}\cdot 2\cdot\sqrt{2} = \frac{\sqrt{2}}{2}$, so that it precisely separates blue points B and D from red points C, E and G.

3. Narrow uncertain area

The uncertain area should be really narrow to see how the model behaves when it is super confident! The area width is $\beta = 2\cdot \frac{5}{w}$. So if we choose $k=10$, weight vector is $\vec{w} = (10, -10)$, its module is $10\cdot\sqrt{2}$ and the uncertain area is $\sqrt{2}/2$. The line distance is used then to get the bias:

$$ \frac{|b|}{w} = \frac{|b|}{10\cdot\sqrt{2}} = $$

$$ =\sqrt{2}/2 \rightarrow |b| = 10\cdot\sqrt{2}\cdot\sqrt{2}/2 = 10 $$

This step is going to return the input $z_1$ of the first activation function, which in turn gives $a_1$ as the output of the sigmoid function. Any point on the right-hand side of this green area will come with the first activation being 1, $a_1 = 1$, the activation of whatever point on the left side, instead, is equal to 0, $a_1 = 0$, and for any point that lies on this green area, it is between 0 and 1. However, in this case, we build this decision boundary to make sure that nothing is going to lie in this uncertain area.

In a similar fashion, we are going to build a second area, which is the purple area in the figure, that defines a second activation. Its purpose is to separate the central red dots from the southeast blue points.

We can use the same weights because the direction has not changed and we know that we just need to change the bias sign to get the opposite location of such a line with respect to the origin. Distance will be the same because it’s $b/w$.

pngM

Figure 1 - NN activation output space is narrow

Now we use the below code to get the grid of nine points (points from A to G) and to transform them to the new 2D space via those two logistic regression functions.

The weights are shared for the two logistic regression transformations, $\vec{w} = (10, -10)$, and bias is $10$ for the first one and $-10$ for the second one.

We can apply this first step as the affine transformation and feed it to the activation function, which is a sigmoid here, and realize that we nearly get either 0s or 1s as output coordinates. That makes sense because we defined the uncertain area as narrow as possible to be pretty sure about every point classification.

In details, points $A$, $B$, $D$ are on the left-hand side of both the green and purple areas. That’s why we got 0 for both the new coordinates $a_1 = a_2 = 0$. Points $F$, $H$, $I$ are instead on the right-hand side of both green and purple areas, so we get 1 for both the new coordinates.

We draw these two sets of points onto the right-hand side of the chart.

Any red point is instead on the right-hand side of the green area, thus we get 1 for $a_1$, and on the left-hand side of the purple area, thus we get 0 for $a_2$. It means that the new coordinates of points $C$, $E$, $G$ are $(1, 0)$.

Now we can see that we have two sets of points, 6 blue points and 3 red points, that we can separate pretty easily with just one line and that means we can apply now a single logistic regression as we applied in the previous cases to separate these two regions.

# weights and biases
ww, bb1, bb2 = np.r_[10, -10].reshape(2, -1), 10, -10

# activation function
sigmoid = lambda xx: 1/(1+np.exp(-xx))

# grid points
PP = np.array([[x1, x2] for x2 in range(2, -3, -2) for x1 in range(-2, 3, 2)])

# affine transformation
z1 = np.dot(PP, ww) + bb1
z2 = np.dot(PP, ww) + bb2

# activation neurons
a1 = sigmoid(z1)
a2 = sigmoid(z2)

# point coordinates in the new 2D space
QQ = np.round(np.hstack((a1, a2)), decimals=4)

# print
for label, pcoord, qcoord in zip(list('ABCDEFGHI'), PP.tolist(), QQ.tolist()):
    print("Point {} = ({}, {}) is now located to ({}, {})".format(label, *pcoord, *qcoord))
Point A = (-2, 2) is now located to (0.0, 0.0)
Point B = (0, 2) is now located to (0.0, 0.0)
Point C = (2, 2) is now located to (1.0, 0.0)
Point D = (-2, 0) is now located to (0.0, 0.0)
Point E = (0, 0) is now located to (1.0, 0.0)
Point F = (2, 0) is now located to (1.0, 1.0)
Point G = (-2, -2) is now located to (1.0, 0.0)
Point H = (0, -2) is now located to (1.0, 1.0)
Point I = (2, -2) is now located to (1.0, 1.0)

We apply the same tricks as before.

The decision line shall be parallel to identity line (or $45°$ line), so weight vector is still something like $(k, -k) \forall k \in \mathbf{R}$. Since red points are $\sqrt{2}/2$ far away from the line crossing the blue points, we need the decision line distance to be half of it, $d = \sqrt{2}/4$. The uncertain area should be narrow enough not to cross any of those points. The area width could be then as equal as the distance, $\beta = \sqrt{2}/4$.

The final transformation is:

$$ \vec{w}\cdot \vec{a} + b = (20, -20)\cdot \vec{a} + 10 $$

What is on the left-hand side of the yellow area is associated with $y=0$ and whatever on the right-hand side to $y=1$.

We are able to properly classify blue/red points as belonging to class zero/one, respectively. Since we are pretty confident about that, the model loss would be nearly 0 and accuracy would be 100%.

The point here is that we force the system to really overfit over such those 9 points. The model would always be prone to overfit, for any new given set of points, as we can stem from the quite large weights in any of the three affine transformations (10 for the first 2s, 20 for the last one).

4. Let’s relax the uncertain area a bit

What if the model needs to be less confident? What if we reduce the weights’ magnitude? The next step is to analyse what is going to happen if we just draw the same decision boundaries, so the two green and purple lines don’t change, but we make the uncertain area much wider by relaxing weights (keep them small).

pngM

Figure 2 - NN activation output space is wider

In this new case, point A is still on the left-hand side of both green and purple areas, so its new coordinates would be still $a_1 = a_2 = 0$.

There is a new situation for B and D. They are on the left-hand side of the green decision line so we still believe they belong to class zero of the first logistic regression, but they also lie on the uncertain area so we are no longer 100% sure about that. Same is happening for $F$, $H$ and the three red points.

If we apply the same methodology as before, but just with different weights, we end up with new coordinates for all the points but $A$. Still, we are able to linearly separate them.

That’s why the problem can again be solved in the same fashion but we try to prevent the model to overfit.

We can solve the problem since we can definitely find a line to separate those points, but that can be achieved with different magnitudes of weights.

Every point of the grid is going to lie on the yellow area, so we are going to assign to any of those points a probability of belonging to one of the two classes, ranging from 0 to 1, but none of them would be strictly either 0 or 1.

It means that the accuracy would be 100% because we are properly classifying all of those points, but the loss is not 0 anymore.

In other words, because the model still returns probability between 0 and 50% for blue points to belong to class one (in the previous case, it gave 0%) and between 50% and 100% for red points to belong to class one (it previously gave 100%).

The model accuracy is 100% since probability $P(y=1)$ is less than 50% for blue points and $P(y=1)$ is greater than 50% for red points, but loss is not 0 since $P(y=1)>0$ for blue points and $P(y=1)<1$ for red points.

import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
# new set of weights and biases
ww, bb1, bb2 = np.r_[2, -2].reshape(2, -1), 2, -2

# affine and activation transformation
z1 = np.dot(PP, ww) + bb1
z2 = np.dot(PP, ww) + bb2
a1 = sigmoid(z1)
a2 = sigmoid(z2)

# point coordinates in the new 2D space
RR = np.round(np.hstack((a1, a2)), decimals=4)

# print
for label, pcoord, qcoord in zip(list('ABCDEFGHI'), PP.tolist(), RR.tolist()):
    print("Point {} = ({}, {}) is now located to ({:0.3f}, {:0.3f})".format(label, *pcoord, *qcoord))
Point A = (-2, 2) is now located to (0.003, 0.000)
Point B = (0, 2) is now located to (0.119, 0.003)
Point C = (2, 2) is now located to (0.881, 0.119)
Point D = (-2, 0) is now located to (0.119, 0.003)
Point E = (0, 0) is now located to (0.881, 0.119)
Point F = (2, 0) is now located to (0.998, 0.881)
Point G = (-2, -2) is now located to (0.881, 0.119)
Point H = (0, -2) is now located to (0.998, 0.881)
Point I = (2, -2) is now located to (1.000, 0.998)

5. Final transformation

The final step of this post is to try to apply a simple mathematical function of the two inputs and use this additional feature to classify the same grid of points.

That feature takes the absolute difference of the two inputs and returns 1 if that absolute difference is less than 1, 0 otherwise.

In the below picture, we see how all the points have been transformed into the new 1D space (right side) where we just have one feature. All the points lie either on the $f=0$ or $f=1$ locations.

The classification task becomes extremely easy since $f=\frac{1}{2}$ separates the two sets of points. The model is 100% confident if the yellow area width is something like $\beta = \frac{1}{2}$.

pngM

Figure 3 - NN activation output space as features for the next layer