1. Introduction

This post belongs to a new series of posts related to a huge and popular topic in machine learning: fully connected neural networks.

The series scope is three-fold:

  1. visualize the model features and characteristics with schematic pictures and charts
  2. learn to implement the model with different levels of abstraction, given by the framework used
  3. have some fun with one of the hottest topics right now!

In this post, we give some geometric insight into what occurs in a single neuron. Note that, if the activation is a sigmoid function, it performs a logistic regression to the inputs. In the following post, we extend this geometric intuition to a neuron network.

2. What is a neural network

We start analyzing a toy problem and understanding why logistic regression (LR) is no suitable and we need more powerful and advanced functions to try to solve the problem. Below there is a chart showing nine points. Different colours mean different classes, so is a binary classification because we have only two colours, blue and red. It’s easy to see it would be impossible to separate the two regions, blue and red, with only one line and this is why we need a more complex and advanced function. We, in fact, need to find a narrow area between the two blue regions where we are going to specify the red class. For that, we need to combine multiple logistic regressions.

pngM

Figure 1 - NN neuron intuition

The main idea is to try to overcome the limit and see how far we can go by just combining more of such functions. If you want to understand more about logistic regression, please see my previous series.

Now we are going to focus on how we can interpret LR geometrically and what we need to understand if you want to combine multiple regressions altogether.

3. Decision boundary

Geometrically we can interpret this function as a simple classifier that is going to draw a line, which is called decision boundary. That line is going to split the entire domain into two sub-areas, one for each class. This decision line is perpendicular to the direction given by the weights if you interpret the weights as a vector.

We can have a look at the below chart and see that we have two weights, one for each input, $x_1$ and $x_2$, in the 2D domain. That weight vector is giving the direction (line r in the chart).

pngM

Figure 2 - Decision boundary

How a combination of weights and biases is going to define the decision line? Mathematically we know that the line defining the decision boundary is:

$$ z = w\cdot X + b = $$

$$ = w_1\cdot x_1 + w_2\cdot x_2 + b = 0 $$

This set of points represents where we don’t know exactly how we can classify the input. But if $z$ becomes greater than 0, we get more confident that it belongs to the first class and if $z$ becomes less than 0, to the other class.

Now we draw a line $s$, which is perpendicular to the previous line and is crossing the origin $(0, 0)$, whose equation is:

$$ s: x_2 = \frac{w_2}{w_1}\cdot x_1 $$

You can prove it that the vector along $s$ is $\big(1, \frac{w_2}{w_1}\big)$ and that it is perpendicular to $r$ since the perpendicular vector of $r$, $(w_1, w_2)$ is proportional to $\big(1, \frac{w_2}{w_1}\big)$.

We got a system of two equations in $x_1$ and $x_2$. We can solve it using very simple mathematical steps and find out that the intersection of these two lines is giving us the point B which is exactly at the point on our decision line and whose distance from the origin is the minimum possible.

We end up with:

$$ r\cap s: x_1 = -\frac{w_1\cdot b}{w^2} $$

$$ x_2 = -\frac{w_2\cdot b}{w^2} $$

We combine these two coordinates into the point P and realize that direction is given by the weight unit vector $\vec{u_w} = \vec{w}/w$ and distance from the origin is the ratio of bias and weight module $b/w$:

$$ P = (x_1, x_2) = -\frac{b}{w^2}\cdot \vec{w} = $$

$$ = -\frac{b}{w}\cdot \vec{u_w} $$

We can also determine the distance of P from origin by using the two coordinates of P and the Euclidean formulation:

$$ d = \sqrt{\frac{(w_1\cdot b)^2+(w_2\cdot b)^2}{w^4}} $$

$$ = \sqrt{\frac{b^2\cdot w^2}{w^4}} = \sqrt{\frac{b^2}{w^2}} $$

$$ d = \frac{|b|}{w} $$

It means that, given a certain level of bias, if we increase the weight magnitude $w$ we are going to shift your decision line toward the origin; if we instead fix the weights’ magnitude and increase bias, the decision line moves away from the origin The direction will always be defined by the weight vector.

4. Affine transformation

Here we see how to draw the decision line. The weights’ vector direction is perpendicular to the decision line. There are two points on such a line whose distance is $\frac{|b|}{w}$, one is $P(b<0)$ on the right-hand side when bias is negative, the other is $P(b>0)$ on the left-hand side when bias is positive.

pngM

Figure 3 - Affine transformation

Another property that we can see in the next figure is that if you collect as many points as you want on the same line $s_1$ perpendicular to $r$, say $x^{(1)}$, $x^{(2)}$, $x^{(3)}$, and feed those points to the affine transformation (geometric projection), we end up with the same output. If we instead take the yellow points $x^{(4)}$, $x^{(5)}$, $x^{(6)}$ that lie on $s_2$ and feed those points to the same transformation, we got a common projected point again.

It means that basically what our logistic regression is doing is projecting every point on a given line perpendicular to $r$ to the same point. It just squishes the 2-dimensional space into 1-dimensional space.

That’s the idea to get the decision line. We know we still need to do one step. If we apply the activation function, say a sigmoid, to the output of the affine transformation, we really got the complete process output of a network layer.

5. Uncertainty area

By analyzing the sigmoid function with respect to the received input, it squishes to $(0, 1)$. The output is either zero or one for most of the input domain and is continuously moving from 0 to 1 only in a really narrow area.

pngM

Figure 4 - 2D visualization of the sigmoid function

In practice, we can just assume the function returns 0 for input less than -5, continuously increases from 0 to 1 in the area ranging from -5 to 5 and returns 1 for input greater than 5. In the above chart, there are three lines:

$$ r: w\cdot \vec{x} + b = -5 \quad \text{red} $$

$$ y: w\cdot \vec{x} + b = 0 \quad \text{yellow} $$

$$ g: w\cdot \vec{x} + b = +5 \quad \text{green} $$

Line $r$ defines the area where the model is very confident to assign the 0-class. The decision boundary, $y$, is the yellow line where we are completely uncertain about the two classes (50-50% chances). The third line, $g$, delimits the area where we are quite confident the points there belong to the other class.

The point of this chart is to show this yellow stripe, which could be narrower or wider according to the model parameters, where we are less sure about anything.

we can still say that, if a point is on the left-hand side of the yellow line, it belongs to class 0 and to class 1 if it is on the other side, but the probability of this statement is lower.

The yellow area is confined between the two lines r and g, whose distance is double the distance from the central line and one of the two lines, $2\cdot \frac{5}{w}$.

This means that, if the weight magnitude increases, the area in which we are somehow uncertainty is going to decrease to become really narrow. If weights go down closer to 0, the yellow area is going to be wider, i.e., the model becomes more confident. The tradeoff can be determined with a proper regularization technique, which usually tends to keep weights as close as possible to 0 to prevent weights to become extremely large and, therefore, unreliable to generalize to unseen data.

From this geometrical interpretation we derive that, if we want to keep weights small, we need to prevent the neural networks to build some super narrow and confident decision boundaries, which then translates into overfitting over new data points. As soon as there is a new point from one class which is slightly shifting to the other class decision area, the model immediately switches to the other class, since there is no uncertain area in between (yellow). This behaviour reflects the inability of the model to generalize to new scenarios.