Notes on backpropagation Link to heading

In optimization and machine learning applications the widely used tool for finding the model parameters is the gradient descent. It allows to find the maximum or minimum of the target function w.r.t. the parameters, in other words, to minimize the discrepancy between the model and the data.

However, to use this method the gradient has to be computed. The first problem is that if our function has a complex form the process of differentiating is quite tricky. The second - gradient calculations might be computationally demanding, especially if we have a large number of parameters: we have to find the gradient w.r.t. each of them. To ease the calculations and improve its efficiency the backpropagation algorithm can be used. From algorithmic point of view it allows us to:

  • split the differentiation into the simple computational steps;
  • reuse the intermediate calculation results.

Chain rule: example Link to heading

Let’s start from an example. We have the following function:

$$ f(x, \boldsymbol{w}, y) = \left(w_0 + w_1 x - y\right)^2 $$

to minimize it, we need to follow the negative of the gradient (since the gradient shows us the direction of growth).

To do so we have find its derivatives with respect to vector $\boldsymbol{w}$. For such a case it is possible to differentiate directly, but for educational purposes let’s utilize the chain rule to compute the gradient of $f$ w.r.t. $\boldsymbol{w}$. To do so we first split the original expression $f$ to the elementary operations:

$$ \begin{align} &a(w_1, x) &&= w_1 x ,&& D_{w_1}a = \frac{\partial a}{\partial w_1} = x & D_{x}a=\frac{\partial a}{\partial x} = w_1\\\\ &b(a, w_0) &&= a + w_0 ,&& D_{a}b = \frac{\partial b}{\partial a} = 1 & D_{w_0}b=\frac{\partial b}{\partial w_0} = 1\\\\ &c(b, y) &&= b - y ,&& D_{b}c = \frac{\partial c}{\partial b} = 1 & D_{y}c=\frac{\partial c}{\partial y} = -1\\\\ &d(c) &&= c^2 ,&& D_{c}d = \frac{\partial d}{\partial c} = 2c \end{align} $$

and our target function can be written as the composition of elementary functions:

$$ f(x, \boldsymbol{w}, y) = d\left(c\left(b\left(a\left(w_1, x\right), w_0\right), y\right)\right). $$

Their derivatives are known, and the chaining of the gradient expressions can be done using multiplication:

$$ \frac{\partial f}{\partial w_1} = \frac{\partial f}{\partial d} \frac{\partial d}{\partial c} \frac{\partial c}{\partial b} \frac{\partial b}{\partial a} \frac{\partial a}{\partial w_1}, $$

noting that $d$ and $f$ are the same functions the $\frac{\partial f}{\partial d} = 1$ and expanding:

$$ \frac{\partial f}{\partial w_1} = \frac{\partial d}{\partial c} \frac{\partial c}{\partial b} \frac{\partial b}{\partial a} \frac{\partial a}{\partial w_1} = 2 c x = 2 (b - y) x = 2 (w_0 + w_1x - y) x $$

Computational graph Link to heading

The visual representation of the computation sequence is shown on the following graph, where:

  • ➡️ the forward pass (🟢 - green color) computes the output values from the inputs;
  • ⬅️ the backward pass (🔴 - red color) uses the chain rule to compute the gradients, on the diagram we show the numeric values of the negative of the gradient.

cg

After completion of the forward pass we have the inputs and outputs for all the nodes (shown in green). Now what we want to compute is the gradient that shows the sensitivity of parameters $\{w_0, w_1\}$ on target function $f$. Since the local gradients for every node are also known we can compute the gradient w.r.t. the target function by simple multiplication. For example, if we want to compute $\frac{\partial f}{\partial c}$ we note that we know the input $c = -3$, the derivative of $c^2$, and $\frac{\partial f}{\partial d} = 1$. Chaining the results we obtain:

$$ \frac{\partial f}{\partial c} = \frac{\partial f}{\partial d}\frac{\partial d}{\partial c} = 1 \cdot 2 \cdot c = 1 \cdot 2 \cdot 3 = -6. $$

After performing this step we know how much the change in node $c$ affects the node $d$: that is the information, given by the derivative. The gradient shows the direction in which a function grows faster, but during the optimization we want to step towards its minimum. To do so we move in the direction of negative gradient. Let’s look at numbers: what $d$ wants is to increase $c$ with a force of $6$ (we negate the numbers, because we are looking for minimum, right?). Does it make sense? Yes, our current value of $c$ is $-3$ and if it becomes higher, $d$ decreases. If we continue this logic we note that $y$, for example, should decrease (since $-\frac{\partial f}{\partial y} = -6$) when other parameters are fixed: it also makes sense, because if $y$ increases it makes our target $f = (b - y)^2~|~b = 4$ higher and what we want to achieve is the opposite: we want to minimize $f$.

The nice and short explanation of the gradient flow process is given in 1:

To compute the gradient of some scalar $z$ with respect to one of its ancestors $x$ in the graph, we begin by observing that the gradient with respect to z is equal to 1. We can compute the gradient with respect to each parent of $z$ in the graph by multiplying the current gradient by the Jacobian of the operation that produced $z$. We continue multiplying by Jacobians, traveling backward through the graph in this way until we reach $x$.

The forward and backward passes for this case can be written in a few lines of code:

x, y = 3, 7
w0, w1 = -2, 2

# forward pass
a = x * w1
b = a + w0
c = b - y
d = c ** 2

# backward pass
df_dc = 2 * c
df_db = df_dc
df_da = df_db
df_dw1 = df_da * x
df_dw0 = df_db

print(f'df/dw0 = {df_dw0}, df/dw1 = {df_dw1}')
df/dw0 = -6, df/dw1 = -18

Note that even in such a simple case some intermediate results are reusable: we do not compute the results twice (for example, for $\frac{\partial f}{\partial c}$), we use the same cached value.

Patterns Link to heading

According to 2, it is useful to keep in mind the interpretation of some operations on intuitive level:

  • add - takes the gradient and distributes it equally to all the inputs;
  • max - routes the gradient to the largest value of inputs (only one);
  • multiply - if we think about the multiplication as a constant term, then the gradients of the output are being split to the inputs inversely proportional to their values.

Implementation Link to heading

Thinking about the actual implementation we note that to calculate the final gradients we need to:

  • correctly define the forward and backward walking order, it can be done using the graphs;
  • store the outputs of the nodes, because the local gradient depends on their values;
  • know the derivative of each node w.r.t. its inputs.

First, let’s define the nodes that perform the calculations, store the results and can calculate the gradient of the output w.r.t. the input:

Spoiler
from abc import ABC, abstractmethod

class Operation(ABC):
    """Base class for node operations.
    
    On forward pass the inherited class calculates the output of the 
    node and nulls the local gradient for future calculations.
    
    """
    
    def __init__(self, inputs):
        self.inputs = inputs
        self.output = None
        self.local_grad = 0.
        
    def forward(self, *args):
        """Calculates the node output given inputs"""
        self.local_grad = 0.
        self._forward(*args)
        
    @abstractmethod
    def _forward(self, *inputs):
        pass
    
    @abstractmethod
    def backward(self, dz):
        """Returns the local gradient."""
        pass


class Multiply(Operation):
    def _forward(self, x, y):
        self.x = x
        self.y = y
        self.output = x * y
        return self.output
    
    def backward(self, dz):
        dx = self.y * dz
        dy = self.x * dz
        return [dx, dy]


class Add(Operation):
    def _forward(self, x, y):
        self.output = x + y
        return self.output
    
    def backward(self, dz):
        return [dz, dz]


class Sub(Operation):
    def _forward(self, x, y):
        self.output = x - y
        return self.output
    
    def backward(self, dz):
        return [dz, -dz]


class Square(Operation):
    def _forward(self, x):
        self.x = x
        self.output = x * x
        return self.output
    
    def backward(self, dz):
        return [2 * self.x * dz]


class Sigmoid(Operation):
    def _forward(self, x):
        from numpy import exp
        self.output = 1. / (1. + exp(-x))
        return self.output
    
    def backward(self, dz):
        return [(1 - self.output) * self.output * dz]


class Div(Operation):
    def _forward(self, a, b):
        self.a = a
        self.b = b
        self.output = a / b
        return self.output
    
    def backward(self, dz):
        return [1 / self.b * dz, - self.a / self.b**2 * dz]


class Inv(Operation):
    def _forward(self, a):
        self.a = a
        self.output = 1 / a
        return self.output
    def backward(self, dz):
        return [-1. / self.a**2 * dz]
    
class Input(object):
    """Class for input values"""
    def __init__(self, value):
        self.output = value
        self.local_grad = 0.0

Now let’s define the ComputationalGraph class, that is responsible for chaining and calculating the forward and backward pass for the graph:

Spoiler
class ComputationalGraph:
    """Computational graph: stores computing nodes, gradients and inputs.
    
    The computational graph is used to calculate the output of each node
    on forward pass and to calculate the local gradients of each node with
    respect to the node inputs.
    """
    
    def __init__(self):
        self.operations = {}
        self.inputs = {}
        self.nodes = {}
        
    def add_inputs(self, **kwargs):
        for name, value in kwargs.items():
            self.inputs[name] = Input(value)
            self.nodes[name] = self.inputs[name]
            
    def add_computation_node(self, name, operation, inputs):
        self.operations[name] = operation(inputs)
        self.nodes[name] = self.operations[name]
        
    def _get_traverse_order(self, root):
        route = []
        queue = [root]
        while len(queue):
            prev = queue.pop(0)
            if prev not in self.inputs:
                route.append(prev)
            if prev in self.operations:
                queue.extend(self.operations[prev].inputs)
        return route
    
    def forward(self, root):
        for name in reversed(self._get_traverse_order(root)):
            node = self.nodes[name]
            node.forward(*[self.nodes[i].output for i in node.inputs])
            
        for inp in self.inputs:
            self.inputs[inp].local_grad = 0.0
            
    def backward(self, root):
        self.operations[root].local_grad = 1.
        for name in self._get_traverse_order(root):
            local_grad = self.nodes[name].local_grad
            inputs = self.nodes[name].inputs
            gradients_wrt_input = self.nodes[name].backward(local_grad)
            # roll back to inputs
            for parent, g in zip(inputs, gradients_wrt_input):
                self.nodes[parent].local_grad += g
                
    def get_gradient(self, node):
        if node in self.nodes:
            return self.nodes[node].local_grad

Now we can make the real job, let’s calculate the gradient of the following function w.r.t. its inputs, $x$ and $y$:

$$ f(x, y) = \frac{x + \sigma(y)}{\sigma(x) + (x + y)^2}. $$
cg = ComputationalGraph()
cg.add_inputs(x = 2., y = -4.)
cg.add_computation_node('sigy', Sigmoid, ('y',))
cg.add_computation_node('num', Add, ('x', 'sigy'))
cg.add_computation_node('sigx', Sigmoid, ('x',))
cg.add_computation_node('xpy', Add, ('x', 'y'))
cg.add_computation_node('sq', Square, ('xpy',))
cg.add_computation_node('denom', Add, ('sigx', 'sq'))
cg.add_computation_node('f', Div, ('num', 'denom'))

cg.forward('f')
cg.backward('f')

dx = cg.get_gradient('x')
dy = cg.get_gradient('y')
print(f'df/dx = {dx}, df/dy = {dy}')
df/dx = 0.5348320870757595, df/dy = 0.342460382923052

And verify the result on an additional example that we studied in the beginning:

$$ f(x, \boldsymbol{w}, y) = \left(w_0 + w_1 x - y\right)^2. $$
cg = ComputationalGraph()
cg.add_inputs(x = 3., y = 7., w0 = -2, w1 = 2)
cg.add_computation_node('w1_times_x', Multiply, ('w1', 'x'))
cg.add_computation_node('prediction', Add, ('w0', 'w1_times_x'))
cg.add_computation_node('l', Sub, ('prediction', 'y'))
cg.add_computation_node('f', Square, ('l',))

cg.forward('f')
cg.backward('f')

dw0 = cg.get_gradient('w0')
dw1 = cg.get_gradient('w1')
print(f'df/dw0 = {dw0}, df/dw1 = {dw1}')
df/dw0 = -6.0, df/dw1 = -18.0

Result is exactly the same. As a final step let’s use this scheme to find the minimum of the following function:

$$ f(x, y) = x^2 + y^2 + 400x + 8y $$
cg = ComputationalGraph()
cg.add_inputs(x = 0., y = 0., a = 400., b = 8.)
cg.add_computation_node('sqx', Square, ('x',))
cg.add_computation_node('sqy', Square, ('y',))
cg.add_computation_node('ax', Multiply, ('a', 'x'))
cg.add_computation_node('by', Multiply, ('b', 'y'))
cg.add_computation_node('sqxpsqy', Add, ('sqx', 'sqy'))
cg.add_computation_node('axpby', Add, ('ax', 'by'))
cg.add_computation_node('f', Add, ('sqxpsqy', 'axpby'))

for i in range(0, 50):
    cg.forward('f')
    cg.backward('f')

    dx = cg.get_gradient('x')
    dy = cg.get_gradient('y')
    cg.inputs['x'].output -= dx * 0.1
    cg.inputs['y'].output -= dy * 0.1
    
print('Minimum:', cg.inputs['x'].output, cg.inputs['y'].output)
Minimum: -199.9971455046146 -3.9999429100922916

All done! Minimum was correctly found, it is time to visual assess it.

Spoiler
import matplotlib.pylab as plt
import numpy as np
%matplotlib inline
%config InlineBackend.figure_format='retina'

x, y = np.meshgrid(np.arange(-800, 800), np.arange(-400, 400))
z = x**2 + y**2 + 400*x + 8*y 
fig, ax = plt.subplots(1, 1, figsize = (6, 3))
ax.imshow(z, extent=[x.min(), x.max(), y.min(), y.max()], origin="lower", cmap='cool')
ax.scatter([cg.inputs['x'].output], [cg.inputs['y'].output], c = 'red', label='Min')
ax.set(xlabel='$x$', ylabel='$y$')
ax.legend()

png

The code is available here: https://github.com/mikoff/blog_projects/tree/master/auto_differentation

References Link to heading