Understanding deep learning: training, SGD, code samples

Recently, I have been reading a new [book]¹ by S. Prince titled “Understanding Deep Learning.” While reading it, I made some notes and practiced with concepts that were described in great detail by the author. Having no prior experience in deep learning, I was fascinated by how clearly the author explains the concepts and main terms.

This post is:

a collection of keynotes from the first seven chapters of the book, that I have found useful for myself;
the numpy-only implementation of deep neural network with variable layers size and training using SGD.

General terms Link to heading

The Universal Approximation Theorem proves that for any continuous function, there exists a network that can approximate this function to any specified precision.

Shallow Neural Networks (SNN) approximate the required function using piecewise linear functions of the input.

The Dimensionality Curse is a well-known problem for Particle Filters. In neural networks, this problem is also known: as the input dimensions grow, the number of linear regions increases rapidly, leading to the need for a larger number of parameters in the neural network to obtain a good approximation of the underlying function.

Layers Link to heading

Neural networks are often described in terms of layers:

input layer;
hidden layer;
output layer.

When data is passed through the neural network, the values that are fed to the hidden layer are termed pre-activations (e.g., before ReLU is applied). The values after the hidden layer, or after ReLU is applied, are called activations.

Here is what the [ReLU]² layer does for 1D-function using SNN:

Generates $D$ linear functions, that are pre-activations.
For each pre-activation, ReLU clamps the input, so its range becomes $[0, +\inf)$; these are the activations.
Then each activation is multiplied by a number, and all of them are summed up with the bias, which controls the height.

Deep Neural networks Link to heading

Since the approximation power of SNN depends on the number of hidden units, for some high-dimensional or complex functions, the required number of hidden units is impractically large. Deep networks produce many more linear regions for the same number of parameters and, as a result, have better descriptive power. For example, the number of linear regions for an SNN with 6 hidden units is equal to 7, but a Deep Neural Network (DNN) consisting of two layers with 3 hidden units each creates 9 linear regions.

Note: Deep networks can create an extremely large number of linear regions, but these contain complex dependencies and symmetries.

Hyperparameters Link to heading

width - the number of hidden units in each layer ($D$);
depth - the number of hidden layers ($K$);
capacity - the total number of hidden units.

Matrix notation Link to heading

$$ \begin{split} \mathbf{h}_1 &= \mathbf{a}[\boldsymbol{\beta}_0 + \boldsymbol{\Omega}_0 \mathbf{x}] \\\\ \mathbf{h}_2 &= \mathbf{a}[\boldsymbol{\beta}_1 + \boldsymbol{\Omega}_1 \mathbf{h}_1] \\\\ &~~\vdots \\\\ \mathbf{h}_{K} &= \mathbf{a}[\boldsymbol{\beta}_{K-1} + \boldsymbol{\Omega}_{K-1} \mathbf{h}_{K-1}] \\\\ \mathbf{y} &= \boldsymbol{\beta}_K + \boldsymbol{\Omega}_K \mathbf{h}_{K} \end{split} $$

where $\mathbf{x}$ is the input vector, $\boldsymbol{\beta}$ - bias vector, which size is equal to $D_k$, $\boldsymbol{\Omega}$ - weight matrix.

One general equation for the whole neural network is:

$$ \mathbf{y} = f[\mathbf{x}, \boldsymbol{\phi}] , $$

where $\boldsymbol{\phi}$ are the parameters of the model, comprising all the weight matrices and bias vectors $\boldsymbol{\phi} = \{\boldsymbol{\beta}_k, \boldsymbol{\Omega}_k\}_{k=0}^K$.

Fitting Link to heading

There are two most popular options for NN training:

Stochastic Gradient Descent (SGD) adds randomness at any given iteration. The mechanism is simple: the algorithm chooses a random subset of data, known as a minibatch, and computes the gradient from this training subset. The batches are usually drawn without replacement. An alternative interpretation of SGD is that it computes the gradient of a different loss function at each iteration; the loss function depends on both the model and the training data and hence will differ for each randomly selected batch. In this view, SGD performs deterministic gradient descent on a constantly changing loss function.
ADAM, adaptive moment estimation, applies moments, or smoothing, both to the estimated gradient and normalizing term (that is the squared gradient).

Initialization Link to heading

Usually the biases are initialized to zero, and the weights as normally distributed with zero mean and variance $\sigma_\Omega^2$. He initialization allows to have the variance of the subsequent pre-activations to be the same as variance of the original pre-activations. To achieve this the variance should be defined as:

$$ \sigma_\Omega^2 = \frac{2}{D_h}, $$

$D_h$ is the dimension of the original layer. The details can be found [there]³.

Code example Link to heading

Let’s consider the following neural network $\mathbf{f}[\mathbf{x}, \boldsymbol{\phi}]$ with input $\mathbf{x}$, parameters $\boldsymbol{\phi}$ and two hidden layers $\mathbf{h}_1, \mathbf{h}_2$:

nn-1

The model parameters $\boldsymbol{\phi} = \{\boldsymbol{\beta}_0, \boldsymbol{\Omega}_0, \boldsymbol{\beta}_1, \boldsymbol{\Omega}_1 , \boldsymbol{\beta}_2, \boldsymbol{\Omega}_2 \}$.

Computing derivatives Link to heading

The derivative tells us how the loss changes if we make a small change to the parameters. Optimization algorithms use this information to update the parameters so that the loss becomes smaller.

Each layer involves the multiplication of the weights and activations from the previous layer. So, any change to the weight affects the activation at the hidden unit. During the forward pass, the activations are stored to compute the gradients later.

Once all the data have moved forward, we can calculate the loss and roll this loss back to each and every parameter in our network. This step is known as the backward pass.

In the following chart, you can see the forward and backward passes for our example neural network. First, we sequentially calculate the actual values of $\mathbf{f}_k$ and $\mathbf{h}_k$, obtaining the loss value at the end.

nn-2

In our example let’s use the least squares loss: $\mathbf{L}(\mathbf{f}_o, \mathbf{y}_i) = (\mathbf{f}_o - \mathbf{y}_i)^2$, whose derivative is well known. Having the numerical value of the derivative of the loss, we can roll it back to all the parameters using the chain rule. For example:

$$ \frac{\partial \mathcal{L}}{\partial\mathbf{h}_1} = \frac{\partial \mathbf{f}_1}{\partial\mathbf{h}_1} \left( \frac{\partial \mathbf{h}_2}{\partial\mathbf{f}_1} \frac{\partial \mathbf{f}_2}{\partial\mathbf{h}_2} \frac{\partial \mathcal{L}}{\partial\mathbf{f}_2} \right). $$

Please note that since matrix multiplication is not commutative, this is the only correct order. The term in brackets is computed in the previous step and can be directly reused.

Actual code Link to heading

Now, let’s define our neural network layers and the general neural network class:

Spoiler

import numpy as np

class HiddenLayer:
    def __init__(self, n_units, input_size, learning_rate):
        self.omega = np.random.normal(0, np.sqrt(2/(n_units)), size=(n_units, input_size))
        self.beta = np.zeros((n_units, 1))
        self.alpha = learning_rate

    def forward(self, input):
        self.h = np.clip(input, 0, None)
        self.f = self.beta + self.omega @ self.h
        return self.f

    def backward(self, dloss):
        self.df_dbeta = np.eye(len(self.beta))
        self.df_domega = np.copy(self.h.T)
        self.df_dh = np.copy(self.omega.T)

        self.dloss_dbeta = self.df_dbeta @ dloss
        self.dloss_domega = dloss @ self.df_domega
        self.dloss_dh = self.df_dh @ dloss

        self.omega -= self.alpha * self.dloss_domega
        self.beta -= self.alpha * self.dloss_dbeta

class InputLayer:
    def __init__(self, n_units, input_size, learning_rate):
        self.omega = np.random.normal(0, np.sqrt(2/(n_units)), size=(n_units, input_size))
        self.beta = np.zeros((n_units, 1))
        self.alpha = learning_rate

    def forward(self, input):
        self.input = np.asarray(input).reshape((-1, 1))
        self.f = self.beta + self.omega @ self.input
        return self.f

    def backward(self, dloss):
        self.df_dbeta = np.eye(len(self.beta))
        self.df_domega = np.copy(self.input.T)

        self.dloss_dbeta = self.df_dbeta @ dloss
        self.dloss_domega = dloss @ self.df_domega

        self.omega -= self.alpha * self.dloss_domega
        self.beta -= self.alpha * self.dloss_dbeta

class NN:
    def __init__(self, input_size, layers_sizes, output_size, learning_rate):
        self.input_layer = InputLayer(layers_sizes[0], input_size, learning_rate)
        self.hidden_layers = []
        for l_input, l_output in zip(layers_sizes[0:], layers_sizes[1:]):
            self.hidden_layers.append(HiddenLayer(l_output, l_input, learning_rate))
        self.output_layer = HiddenLayer(output_size, layers_sizes[-1], learning_rate)

    def forward(self, input):
        o = self.input_layer.forward(input)
        for l in self.hidden_layers:
            o = l.forward(o)
        o = self.output_layer.forward(o)
        return o

    def backward(self, predicted_value, true_value):
        dloss = 2 * (predicted_value - true_value)
        self.output_layer.backward(dloss)
        dfprev_dhprev = self.output_layer.df_dh
        
        for l in self.hidden_layers[::-1]:
            dhprev_df = np.diag((l.f > 0).flatten()).astype(int)
            dloss = dhprev_df @ dfprev_dhprev @ dloss
            l.backward(dloss)
            dfprev_dhprev = l.df_dh

        dhprev_df = np.diag((self.input_layer.f > 0).flatten()).astype(int)
        dloss = dhprev_df @ dfprev_dhprev @ dloss
        self.input_layer.backward(dloss)

    def train(self, input, true_value):
        p = self.forward(input)
        self.backward(p, true_value)

It’s the time to define our first function $f(x) = \sin(10 x) + 0.75$ and train our NN to approximate it:

Spoiler

import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

# our function that will be approximated by our NN
f = lambda x: np.sin(x * 10) + 0.75

# NN structure - 3 layers, each having 12 units, input and output are 1d
learning_rate = 0.01
nn = NN(1, (24, 24, 24), 1, learning_rate)

# generate 10000 samples and train NN
for b in range(0, 10000):
    x = np.random.uniform(-1, 1, size = (1, 1))
    y = f(*x)
    nn.train(x, y)

# now let's get true values and predictions for some range
x = np.linspace(-1, 1, 100)
y_predict = np.zeros_like(x)
for i, xx in enumerate(x):
    y_predict[i] = nn.forward(xx)

# visualize 
fig, ax = plt.subplots(1, 1, figsize = (6, 3))
ax.plot(x, f(x), label = "True function")
ax.plot(x, y_predict, label = "NN approximation")
ax.set(xlabel='$x$', ylabel='$y$')
ax.grid()
ax.legend(loc='upper left')

png

Let’s do the same for the second function, which is 2d:

Spoiler

# our function that will be approximated by our NN
f = lambda x, y: 25 * x + 17.8 * y + 0.25 * x * y + x * x * 15.0 - y * y * 5.0 - 25 * (x < 0.0 and y < 0.0)

# NN structure - 2 layers, each having 12 units, input is 2d, output is 1d
learning_rate = 0.001
nn = NN(2, (12, 24), 1, learning_rate)

# generate 10000 samples and train NN
for b in range(0, 10000):
    x = np.random.uniform(-1, 1, size = (2, 1))
    y = f(*x)
    nn.train(x, y)

# now let's get true values and predictions for some range
x, y = np.arange(-1, 1, 0.05), np.arange(-1, 1, 0.05)
y_predict = np.zeros((x.shape[0], y.shape[0]))
y_true = np.zeros((x.shape[0], y.shape[0]))
for i, xx in enumerate(x):
    for j, yy in enumerate(y):
        y_predict[i, j] = nn.forward([xx, yy])
        y_true[i, j] = f(xx, yy)

# visualize 
fig, ax = plt.subplots(1, 2, figsize = (6, 3))
ax[0].imshow(y_true, extent=[x.min(), x.max(), y.min(), y.max()], origin="lower", cmap='cool', vmin=np.min(y_true), vmax=np.max(y_true))
ax[0].set(xlabel='$x$', ylabel='$y$')
ax[0].set_title("True function")
ax[1].imshow(y_predict, extent=[x.min(), x.max(), y.min(), y.max()], origin="lower", cmap='cool', vmin=np.min(y_true), vmax=np.max(y_true))
ax[1].set(xlabel='$x$', ylabel='$y$')
ax[1].set_title("NN approximation")
fig.tight_layout()

png

Bingo!

The great video on how neural networks learn to approximate any function is ⁴.

General terms Link to heading

Layers Link to heading

Deep Neural networks Link to heading

Hyperparameters Link to heading

Matrix notation Link to heading

Fitting Link to heading

Initialization Link to heading

Code example Link to heading

Computing derivatives Link to heading

Actual code Link to heading

References Link to heading