## 1. Problem formulation

We want to create a simplified representation (model) of real/world from available data. The model architecture can be seen as a box that takes some quantities as input, performs some internal computation and returns some other quantities as output. The inputs are also referred to as features or predictors of the problem to solve, since they contain valuable information that the model should exploit to come up with the correct outcome. Two major problems are present, according to the output format:

1. Classification: the output is a discrete variable that can assume integer values within a predefined limited set, where the set dimension is the number of classes. If the number of classes is two, it is a binary classification problem, where the outcome can belong to either one class or the other one. One example is spam detection, where the outcome can belong to either the spam class or the ham. If more than two classes are present, it is a multinomial (or multi-class) classification problem. Some examples are object recognition, where the outcome gives the probability that the input (an image) contains a specific object which the model has been trained for, topic identification, where the model selects the topic of the input text within a defined list.
2. Regression: the output is a continuous variable that can assume real values. Some examples are house price prediction, electricity price estimation and Flight delay forecast.

## 2. The hello-world case for Machine Learning

The problem we are going to study and analyze in this post is the very introductive example of the Machine Learning algorithm field. I would say it is the Hello world of ML. I chose to get started talking about ML with this very easy problem so it would be easier for you to focus on the ML implementation and analysis.

We (in this contest we refers to the human designers) need to define the architecture and give some boundaries to it (model structure, see Fig. 1). The model structure reduces the searching space to the machine. The machine is going to adopt the degrees of freedom we assign. It will then choose the optimal configuration within the assigned boundaries. We are both meta-designers and teachers. The machine can only generate some instances that belong to the family structure we have defined. If our design concept is internally flawless and robust, the machine will always be finding something useful. We are also teachers. We say the machine how well/bad it is performing within the assigned task. That’s why we need a loss function (see Fig. 1). The machine can use that function at any time to measure its performance. Its specific goal is to minimize that loss function as much as possible by changing its parameters, i.e., what we set as degrees of freedom to the machine. Saying that the machine aims at improving its score or performance is equivalent to saying it aims at minimizing the loss function. How can the machine change its parameters to minimize the loss function? That is the learning algorithm that we need to state to guide the learning process (see Fig. 1). In other words, the learning recipe guides the machine to change to improve its score.

Figure 1 shows the scheme of the main points of a Machine-learning framework. The top chart illustrates the interaction between data collected from the real-world, the designer and the machine. The designer takes care of three tasks: model structure, score (loss function) and the learning recipe. The machine uses the tools provided by the designer to improve its performance to represent the real-world scenario.

Figure 1 - Machine learning general structure

## 3. Loss function

We present three ways to define the same loss function:

1. Energy-based definition, where the energy equilibrium of a mechanical system that consists of several mass-concentrated points are connected to a rigid bar by means of a spring.
2. Maximum likelihood estimation, where a statistical approach is used.
3. Least-squared error, where the residual definition is adopted.

### 3.1 Energy-based loss function

For the sake of visualization, we use a 2D plot to show the correlation of a 1D input, x, and the continuous output, y. Let assume the model output shape (a line) represents a bar and each data point is a mass connected to the bar via a vertical spring. The energy stored in the spring is proportional to the spring stiffness and the squared displacement:

$$E_j = \frac{1}{2}\cdot k_j\cdot \delta x_j^2$$

where the displacement is the vertical distance between the mass (data point) and its projection to the bar (see Fig.2).

Figure 2 - Energy-based loss function scheme

The overall energy content of the system is the sum of the contribution of each spring. The system tends to reach equilibrium by minimizing its energy content. The equilibrium is identified by the least energy possible configuration. If every point is equally relevant, the stiffness value is constant for every spring (for simplicity we can set $k_j = 1$ for each $j$). Otherwise, we can force the system either to be extremely precise to represent one point by increasing its associated spring’s stiffness or to ignore one point by reducing the stiffness down to 0.

### 3.2 MLE-based loss function and LSME

The correlation between the output $y_j$ and the input $x_j$ can be formulated as follows:

$$y_j = \theta^T \cdot x_j + \epsilon_j$$

where $\epsilon_j$ is the error related to point $P_j$ that includes either random noise or effects that are not captured by the selected predictors in the $x$ term. If the error values are independently and identically distributed with a normal distribution ($\epsilon_j \mathtt{\sim} N(0, \sigma^2)$), we can state that the density distribution is as follows:

$$p(\epsilon_j) = \frac{1}{\sqrt{2\pi}\cdot \sigma}\cdot \exp^{-\frac{\epsilon_j^2}{2\cdot\sigma^2}}$$

Figure 3 - Likelihood-based loss function scheme

By combining the latter equation with the error-based linear regression expression (1), the output density given inputs and parameter estimation for point $P_j$ is:

$$p(y_j | x_j; \theta) = \frac{1}{\sqrt{2\pi}\cdot \sigma}\cdot \exp^{-\frac{(y_j-\theta^T \cdot x_j)^2}{2\cdot\sigma^2}}$$

The probability of the dataset is given by the combined contributions of every data point. Since inputs and outputs are fixed, we can select the best values of parameter vector $\theta$ that maximize the likelihood of the dataset:

$$\max\limits_{\theta} L(\theta) \quad\text{where}\quad L(\theta) = p(\hat{y} | \hat{x}; \theta)$$

where $\hat{x} \in R^{m, n}$ and $\hat{y} \in R^{m, 1}$ are the matrix and the vector containing the dataset inputs and outputs, $m$ is the number of data points, $n$ is the dimension of the predictors. Since the data points are independent from each other, the likelihood function can be re-written as:

$$L(\theta) = \Pi_{j=1}^{m} p(y_j | x_j; \theta)$$

$$L(\theta) = \Pi_{j=1}^{m} \frac{1}{\sqrt(2\pi)\cdot \sigma}\cdot \exp^{-\frac{(y_j-\theta^T \cdot x_j)^2}{2\cdot\sigma^2}}$$

Maximizing $L$ is equivalent to maximizing $l(\theta) = log(L(\theta))$ due to the strictly increasing behavior of the log function.

$$l(\theta) = log \Pi_{j=1}^{m} \frac{1}{\sqrt{2\pi}\cdot \sigma}\cdot \exp^{-\frac{(y_j-\theta^T \cdot x_j)^2}{2\cdot\sigma^2}}$$

$$\rightarrow = \sum_{j=1}^{m} \log \frac{1}{\sqrt{2\pi}\cdot \sigma}\cdot \exp^{-\frac{(y_j-\theta^T \cdot x_j)^2}{2\cdot\sigma^2}}$$

$$\rightarrow = C - \frac{1}{2\cdot\sigma^2}\sum_{j=1}^{m} (y_j-\theta^T \cdot x_j)^2$$

where $C = m\cdot\log\frac{1}{\sqrt{2\pi}\cdot \sigma}$ is a constant value.

Maximizing $l$ is equivalent to minimizing the loss function obtained with the least-squared error:

$$LSE = \frac{1}{2}\sum_{j=1}^{m} (y_j-\theta^T \cdot x_j)^2$$

Under the previous probabilistic assumptions on the data, least-squared error minimization corresponds to finding the maximum likelihood estimate of $\theta$.

### 3.3 Least squared error

This method aims at finding the curve that represents a set of points at best by minimizing the sum of the squares of the residuals, where the residual is the difference between the data point and the corresponding projection on the identified curve. The error is squared to obtain a continuous and differentiable loss function. This property, however, could lead to undesirable fitting results due to the presence of outliers (points that are far away from the major concentration of data points). The residual is calculated using the vertical projection of the data point to the curve, in order to simplify the formulation.

Gradient descent algorithm is used to optimize the loss function, i.e., to find the parameter values that minimize it. From calculus theory, we exploit that the minimum of a continuous function can be identified by solving the following problem $\nabla_\theta J = 0$. In other words, we need to find the parameter values $\theta$ that let the gradient of the function J to be 0. The gradient is the vector perpendicular to the isolines of the surface described by J. The gradient is the 0-vector at every minimum, maximum or saddle point of the surface. Let us analyse the following example:

$$J = \omega_1^2 + \omega_2^2 + 5$$

This is a paraboloid in a 3D space, or an infinite collection of circles centred in the origin and whose radius is $\sqrt{J-5}$. That implies that the smallest circle has 0 radius for $J=5$. Every circle is obtained by slicing the 3D space with a horizontal plane $J = k$, where $k$ is the height of the plane. The gradient of that function is a vector where every element is the variation of the function with respect to the corresponding parameter, as follows:

$$\nabla_{\theta} J = (\frac{\partial J}{\partial \omega_1}, \frac{\partial J}{\partial \omega_2})$$

In our example, the gradient is:

$$\nabla J = 2\cdot(\omega_1, \omega_2)$$

We draw this vector field in conjunction with the isolines. It is 0 if and only if $(\omega_1, \omega_2) = (0, 0)$. Since any point in the neighbouring area of $(0, 0)$ leads to higher J values, we conclude it is a minimum. Since machine-learning problems cannot be analytically solved, we need to iterate to gradually get closer to the minimum point by changing the parameters.

The gradient descent algorithm changes the parameter vector values at each epoch, k (or iteration), using the gradient direction and magnitude. The actual change in $\theta$ is controlled also by the learning rate $\alpha$, as follows:

$$\theta^{k+1} = \theta^k - \alpha \nabla_{\theta} J(\theta)$$

The parameter update rule can be explicitly written for each element as follows:

$$\theta_j^{k+1} = \theta_j^k - \alpha \frac{\partial J(\theta)}{\partial \theta_j}$$

### 4.1 Gradient field of a 2D surface function

The figure compares the gradient (vector field) to the isolines of the J function. In any point of the chart, the colour gives the intensity of the J function, so that the colder it is, the lower the J value is (see the colorbar in the right-hand side of the chart). The arrow in the same point instead shows the direction of the gradient on which the function increases. The arrow length is related to the J increment, which means a longer arrow indicates a steeper increment of the J surface.

The isolines are shown using the contourf command, while the gradient field with quiver command provided by the Matplotlib library.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
w1s = np.linspace(-10, 10, 50+1)
w2s = np.linspace(-5, 5, 25+1)
w1mg, w2mg = np.meshgrid(w1s, w2s)
J = w1mg**2 + w2mg**2 + 5

plt.figure(figsize=(15, 6))
plt.contourf(w1mg, w2mg, J, alpha=0.75)
plt.colorbar()
qq = plt.quiver(w1s, w2s, 2*w1mg, 2*w2mg)
plt.xlabel("$\omega_1$")
plt.ylabel("$\omega_2$")
plt.axis('equal')
plt.show()