1. Introduction

This post belongs to a new series of posts related to a huge and popular topic in machine learning: fully connected neural networks.

The general series scope is three-fold:

  1. visualize the model features and characteristics with schematic pictures and charts
  2. learn to implement the model with different levels of abstraction, given by the framework used
  3. have some fun with one of the hottest topics right now!

In this new post, we are going to analyze the hyperparameter (HP) space for a multi-class classification problem in Keras. We consider the impact of two HPs combined on the model loss, where each combination is listed here:

  1. activation function and hidden layer size
  2. activation function and network depth
  3. hidden layer size and network depth (full network size)
  4. optimizer and learning rate
  5. optimizer and batch size
  6. optimizer and hidden layer size

The whole code to create a synthetic dataset and learn a neural network model with any of the four libraries mentioned above is wrapped into a Python class, trainFCNN(), and can be found in my Github repo.

2. Hyperparameters

Here we define the list of HPs that are usually responsible to have an impact on the model performance along with their set of values:

  1. nb_hidNeurons: the number of neurons in hidden layer j, which we specify as an integer at j index of the dims attribute.
  2. depths: the number of hidden layers, which is the dims length.
  3. optimizers: type of optimization, which could be any of SGD, Adam, RMSProp or AdamGrad.
  4. learnRates: learning rate.
  5. activations: activation function, which could be any of sigmoid, relu or tanh.
  6. batchSizes: batch size.

Here the list of HPs with the corresponding set of values used for the analysis.

activations = ['sigmoid', 'tanh', 'relu']
optimizers = ['sgd', 'adam', 'rmsprop', 'adagrad']
learnRates = [1, 5, 10]
nb_hidNeurons = [2, 4, 8, 16]
depths = [1, 2, 3]
epochs = [50, 150, 250]
batchSizes = [50, 100, 250]
libs = ['sk', 'ks', 'tf', 'pt']

Create the tnn instance for the circles dataset.

tnn = trainFCNN(nb_pnt=2500, dataset='circles')
tnn.plotPoints()

png

3. Activation and hidden layer size

The first comparison regards the impact of the hidden layer size and its activation function. In general, when the size is not large enough, the optimizer can get stuck (the loss remains almost constant). The activation function can help either to reach a better result and/or to reach the same result faster, as it is the case for the relu function (red lines) wrt to tanh (green) and sigmoid (blue).

hp1s = activations
hp2s = nb_hidNeurons
Nhp2 = len(hp2s)
mdls = []
for hp1, hp2 in itertools.product(hp1s, hp2s):
    tnn.train(nb_epochs=100, dims=[hp2], activation=hp1, lib='ks')
    mdls.append(deepcopy(tnn))
descrKeys = ['units', 'act']
plt.figure(figsize=(15, 8))
for kk, tnn in enumerate(mdls):
    col, mark = colors[kk // Nhp2], markers[kk % Nhp2]
    plt.plot(tnn.lossHistory, label=tnn.mdlDescription(descrKeys), lw=2, ls=mark, color=col, alpha=.75)
plt.grid()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('# of epochs')
plt.ylabel('model loss')
plt.show()

png

4. Activation and NN depth

The next comparison regards the impact of the network depth and its activation function. In general, when the size is not large enough, the optimizer can get stuck (the loss remains almost constant). The activation function can help either to reach a better result and/or to reach the same result faster, as it is the case for the relu (red lines) and tanh (green) functions wrt sigmoid (blue).

A nice result concerns also the network size. The deeper the network is, the worse the final loss. However, this outcome is not general, it is just a combination of this specific network with other non-optimized HPs such as learning rate and layer size.

hp1s = activations
hp2s = depths
Nhp2 = len(hp2s)
mdls = []
for hp1, hp2 in itertools.product(hp1s, hp2s):
    tnn.train(nb_epochs=150, dims=[4]*hp2, activation=hp1, lib='ks', lr=0.01)
    mdls.append(deepcopy(tnn))
descrKeys = ['units', 'act']
plt.figure(figsize=(15, 8))
for kk, tnn in enumerate(mdls):
    col, mark = colors[kk // Nhp2], markers[kk % Nhp2]
    plt.plot(tnn.lossHistory, label=tnn.mdlDescription(descrKeys), lw=2, ls=mark, color=col, alpha=.75)
plt.grid()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('# of epochs')
plt.ylabel('model loss')
plt.show()

png

Let’s see what happen for higher layer size, say 8. The largest and smallest networks converge to the same result if we use relu. For tanh function, the mid-sized net ranks better than the small one, while the large one gets stuck. For sigmoid function, the large net outperforms the other two nets to a great extent.

mdls = []
for hp1, hp2 in itertools.product(hp1s, hp2s):
    tnn.train(nb_epochs=150, dims=[8]*hp2, activation=hp1, lib='ks', lr=0.01)
    mdls.append(deepcopy(tnn))
plt.figure(figsize=(15, 8))
for kk, tnn in enumerate(mdls):
    col, mark = colors[kk // Nhp2], markers[kk % Nhp2]
    plt.plot(tnn.lossHistory, label=tnn.mdlDescription(descrKeys), lw=2, ls=mark, color=col, alpha=.75)
plt.grid()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('# of epochs')
plt.ylabel('model loss')
plt.show()

png

5. Hidden layer size and NN depth

The next comparison regards the impact of the network depth and each hidden layer size.

The main result is that larger networks tend to converge faster to the optimal solution, while smaller ones could not converge at all.

hp1s = nb_hidNeurons
hp2s = depths
Nhp2 = len(hp2s)
mdls = []
for hp1, hp2 in itertools.product(hp1s, hp2s):
    tnn.train(nb_epochs=150, dims=[hp1]*hp2, activation='relu', lib='ks')
    mdls.append(deepcopy(tnn))
descrKeys = ['units']
plt.figure(figsize=(15, 8))
for kk, tnn in enumerate(mdls):
    col, mark = colors[kk // Nhp2], markers[kk % Nhp2]
    plt.plot(tnn.lossHistory, label=tnn.mdlDescription(descrKeys), lw=2, ls=mark, color=col, alpha=.75)
plt.grid()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('# of epochs')
plt.ylabel('model loss')
plt.show()

png

6. Optimizer and learning rate

The next comparison regards the impact of the loss function optimizer and the learning rate.

The main result is that Adam helps the network to converge faster and a tradeoff for the learning rate is required to reach the optimal solution earlier.

hp1s = optimizers
hp2s = learnRates
Nhp2 = len(hp2s)
mdls = []
for hp1, hp2 in itertools.product(hp1s, hp2s):
    tnn.train(nb_epochs=150, dims=[16, 16], activation='relu', opt=hp1, lr=hp2*1e-3, lib='ks')
    mdls.append(deepcopy(tnn))
descrKeys = ['units', 'lib', 'optimizer', 'learning_rate']
plt.figure(figsize=(15, 8))
for kk, tnn in enumerate(mdls):
    col, mark = colors[kk // Nhp2], markers[kk % Nhp2]
    plt.plot(tnn.lossHistory, label=tnn.mdlDescription(descrKeys), lw=2, ls=mark, color=col, alpha=.75)
plt.grid()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('# of epochs')
plt.ylabel('model loss')
plt.show()

png

7. Optimizer and batch size

The next comparison regards the impact of the loss function optimizer and the batch size.

The main result is that the batch size should not be too high. This outcome makes sense since a very high batch size reduces the effect of the stochastic gradient descent.

hp1s = optimizers
hp2s = batchSizes
Nhp2 = len(hp2s)
mdls = []
for hp1, hp2 in itertools.product(hp1s, hp2s):
    tnn.train(nb_epochs=int(150*batchSizes[0]/hp2), dims=[16, 16], activation='relu', opt=hp1, batchSize=hp2, lib='ks')
    mdls.append(deepcopy(tnn))
descrKeys = ['optimizer', 'epochs', 'batchSize']
plt.figure(figsize=(15, 8))
for kk, tnn in enumerate(mdls):
    col, mark = colors[kk // Nhp2], markers[kk % Nhp2]
    plt.plot(tnn.lossHistory, label=tnn.mdlDescription(descrKeys), lw=2, ls=mark, color=col, alpha=.75)
plt.grid()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('# of epochs')
plt.ylabel('model loss')
plt.show()

png

8. Optimizer and hidden layer size

The next comparison regards the impact of the loss function optimizer and the hidden layer size. The main result is that a larger hidden layer speeds up the model convergence.

hp1s = optimizers
hp2s = nb_hidNeurons
Nhp2 = len(hp2s)
mdls = []
for hp1, hp2 in itertools.product(hp1s, hp2s):
    tnn.train(nb_epochs=100, dims=[hp2], activation='relu', opt=hp1, lib='ks')
    mdls.append(deepcopy(tnn))
descrKeys = ['units', 'optimizer']
plt.figure(figsize=(15, 8))
for kk, tnn in enumerate(mdls):
    col, mark = colors[kk // Nhp2], markers[kk % Nhp2]
    plt.plot(tnn.lossHistory, label=tnn.mdlDescription(descrKeys), lw=2, ls=mark, color=col, alpha=.75)
plt.grid()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.xlabel('# of epochs')
plt.ylabel('model loss')
plt.show()

png