Backpropagation for dummies

Backpropagation is a method of training an Artificial Neural Network. If you are reading this post, you already have an idea of what an ANN is. However, lets take a look at the fundamental component of an ANN- the artificial neuron.

The figure shows the working of the ith neuron (lets call it $N_i$) in an ANN. Its output (also called its activation) is $a_i$. The jth neuron $N_j$ provides one of the inputs to $N_i$.

How does $N_i$ produce its own activation?

1. $N_i$ has stored in itself, a weightage assigned to each of its inputs. Lets say that the weightage assigned by $N_i$ to $N_j$ is $w_{i, j}$. As the first step, $N_i$ computes a weighted sum of all its inputs. Lets call it $in_i$. Therefore,

$in_i = \sum_{k \in Inputs(N_i)} w_{i, k}a_k$ … (1)

where $Inputs(N_i)$ corresponds to all neurons that provide an input to $N_i$.

2. This sum, called the input function, is then passed on to $N_i$‘s activation function. It is called so, because it computes the activation that will be provided by $N_i$ to the units it in turn acts as an input to. Therefore,

$a_i = g(in_i)$ … (2)

Usually, the activation functions of all neurons in an ANN are the same, so we just write $g$ instead of $g_i$.

There are many options for $g$, such as linear functions, rectified linear functions, hyperbolic tan, etc. For the sake of this tutorial, we will use the sigmoid function (I’ll explain why later). So we have

$g(x) = \frac{1}{1 + e^{-x}}$ … (3)

Thus, every neuron in an ANN – except the ones in the lowest-level input layer, and bias neurons – takes in the activation of all its inputs, and provides its own activation as an output.

Once a neural network has been designed, it is the job of the training process to ensure that the individual weightages $w_{i, j}$ given by each neuron to each of its inputs is set just right, so that the whole network gives the right output. Backpropagation is one of the ways to optimize those weights.

Whenever we talk of optimization, it is necessary to properly define what we are optimizing and how we are doing it.

What are we trying to optimize?

The error function. This function defines how much the output of our model (in this case an ANN) defers from the required output. For the sake of this tutorial, we will use the Mean-Squared Error(MSE) function. For example, let $N_o$ be one of the output neurons of our ANN. For a given training sample, let the expected output be $t_o$. However, $N_o$ provides an output equal to its activation value, that is $a_o$. Therefore, the error as defined by MSE would be

$E = \frac{1}{2} (t_o - a_o)^2$ … (4)

Since $a_o$ by its definition depends on the weightages given by $N_o$ to its inputs, the ultimate valye of $E$ depends on those weightages too.

It is important to remember that every training sample will result in a different error value. It is backpropagation’s goal to minimize (bring as close to zero as possible) the value of $E$ for all training samples on an average. The way this is done, as mentioned before, is by fine-tuning the values of the weights $w_{i, j}$.

But how does backpropagation fine tune these weightage values? By using a technique called Gradient Descent.

Consider a function as follows:

$f(x) = 3x + 4$

The gradient of the function $f$ with respect to the variable $x$ defines how much the value of $f$ will change with a unit increase/decrease in the value of $x$. For the above example, the answer to that is 3. Basically, you are differentiating $f$ with respect to $x$. Generally speaking, a function may be dependent on a variety of parameters, so you take a partial derivative when computing the gradient.

Consider you are on a hill like the one shown in the image below:

The red regions correspond to places of higher function value, while the blue regions correspond to low values. You want to reach the lowest point possible. If you are standing on the red hill, and you can only see a small part of the terrain around you, what will you do? You will naturally take steps in the direction in which the terrain slopes down. With each step you take, you will scan your immediate neighbourhood (which is visible to you), and go in the direction that shows the steepest descent.

In mathematical terms, this ‘direction’ is nothing but the gradient of your height (the target function you are trying to minimize), with respect to your location (x, y) (which are the parameters you are fine tuning). The overall direction in which you move, will be dependent on the gradients with respect to both dimensions, i.e. $\frac{\partial height}{\partial x}$ and $\frac{\partial height}{\partial y}$. That’s pretty much what gradient descent does.

Its called so, because you are descending from a higher value of the target function to a lower value, by following the direction that corresponds to the lowest gradient (and consequently the steepest decrease in function value).

One thing to note is that you would always want to move in the direction shown by the negation of the gradient. Consider the above example of $f(x) = 3x + 4$ . The gradient here is 3, which is positive. So, with every unit increase in $x$, $f(x)$ would increase by 3. Therefore, to reduce $f(x)$, you would decrease the value of $x$ . The opposite is true in scenarios where you want to maximize the target function. In those cases, you move in the direction of the gradient itself.

Coming back to the context of ANN training, backpropagation reduces the error function $E$ by taking help of the gradients with respect to individual weights, $\frac{\partial E}{\partial w_{i, j}}$.

How does backpropagation employ gradient descent?

First off, it is necessary to keep in mind that backpropagation tries to optimize the inter-neuron weights, and not the neurons themselves. Their functionality is pretty much decided when you decide the activation function.

Secondly, the error function value $E$ depends on the current values of the weights and the training sample in question, apart from the definition of the activation function. So to compute the error with respect to a training sample, you first feed it as an input to the network. You will then receive the appropriate output(s) from the output neuron(s). This part is known as  ‘forward propagation‘, since you are going from the input layer to the output layer (through the hidden layers if any). If you have multiple output neurons, the error values corresponding to each will affect the weights appropriately.

Then, you use the network output(s) along with the expected value(s) to calculate the errors using Equation 4 given above. Once the error values are known, the weights are updated using the gradients. This updating occurs first for the out-most weights, and then for lower layers. The intuition behind this, is the fact that the errors with respect to the outmost neurons are known first, and you derive the errors of the inner neurons from them. Each layer’s errors are calculated from the errors of the neurons they provide input to. That is why the second phase is called ‘backward propagation‘, or back-propagation.

So as a first step, lets try to understand how we will calculate the gradient $\frac{\partial E}{\partial w_{i, j}}$ for a given training sample. Remember that  $w_{i, j}$ is the weightage given by $N_i$ to $N_j$.

Using the chain rule in Leibniz’s form,

$\frac{\partial E}{\partial w_{i, j}} = \frac{\partial E}{\partial in_i} \frac{\partial in_i}{\partial w_{i, j}}$ … (5)

From Equations 1 and 5, and using basic differentiation,

$\frac{\partial E}{\partial w_{i, j}} = a_j \frac{\partial E}{\partial in_i}$ … (6)

From the basic definition of the activation function (and the appropriate current weight values), you already know the value of $a_j$. Now consider the second term in the above product. Using the chain rule again on it, we get

$\frac{\partial E}{\partial in_i} = \frac{\partial E}{\partial a_i} \frac{\partial a_i}{\partial in_i}$ … (7)

This is where the whole advantage of using the sigmoid activation function comes into play. The derivative of a sigmoid function with respect to its input, corresponds to a nice expression. Given equation 3, you can easily verify that

$\frac{dg}{dx} = g(x)(1 - g(x))$ … (8)

From Equations 7 and 8,

$\frac{\partial E}{\partial in_i} = a_i (1 - a_i) \frac{\partial E}{\partial a_i}$ … (9)

Putting the information in Equations 6 and 9 together,

$\frac{\partial E}{\partial w_{i, j}} = a_j a_i (1 - a_i) \frac{\partial E}{\partial a_i}$ … (10)

Look at the last term. There are two possible cases for $N_i$ ($N_i$ cannot be an input neuron, since it is receiving input from another neuron) :

I. $N_i$ is an output neuron. In this case, the second term on the RHS is pretty simple, from Equation 4-

$\frac{\partial E}{\partial a_i} = - (t_i - a_i)$ … (11)

where $t_i$ is the expected output from $N_i$ (for the current training sample).

2. $N_i$ is an inner neuron. Consider all neurons $N_k$, for whom $N_i$ acts an an input. Since we are propagating backward, you can therefore assume that we know the values $\frac{\partial E}{\partial a_k}$ for all of them. In fact, since the output of $N_i$ affects the outputs of all these $N_k$ s, the expression for  $\frac{\partial E}{\partial a_i}$ will involve all of them.

Using simple chain rule,

$\frac{\partial E}{\partial a_i} = \sum_{k} \frac{\partial in_k}{\partial a_i} \frac{\partial E}{\partial in_k}$ … (12)

From Equations 12, 1 and 9,

$\frac{\partial E}{\partial a_i} = \sum_{k} w_{k, i} a_k (1 - a_k) \frac{\partial E}{\partial a_k}$ … (13)

Equations 10, 11 and 13 pretty much define the gradient values $\frac{\partial E}{\partial w_{i, j}}$ for all weights in a given ANN. So how are the weights actually updated?

Using the following rule:

The change caused by a given training sample, at time instance ‘t’, for the weight $w_{i, j}$ is given by

$\Delta w_{i, j}^{t} = - \epsilon \frac{\partial E}{\partial w_{i, j}} + \alpha \Delta w_{i, j}^{t - 1}$

Remember me mentioning that we always need to move in the direction of the gradient’s negation? Thats why the minus sign for the first term. The value $\epsilon$ is called the learning rate. We should never be too hasty in changing the weight’s value too drastically based on the gradient with respect to one training sample. Therefore, the gradient value is multiplied by the learning rate (usually well below 0.5) before causing the required change in the weightage value. If the learning rate is too low, the algorithm takes too long to train. On the other hand, if its too high, the algorithm keeps ‘bouncing around’ in the search space. Imagine you are running towards point B from point A. But you run so fast that you cant stop in time at point B, but you actually end up far ahead of it. Then you again run in the opposite direction towards B, once again crossing it and stopping ahead because of your high speed. Thats pretty much what happens if your learning rate is too large.

A variant of backpropagation, called resilient propagation(RPROP) improves on this classic method by having custom deltas with respect to each weightage in a network. It only uses the gradients as references to determine which direction to move in. Therefore, you don’t have to provide the learning rate as a parameter to resilient propagation.

Now look at the second term in the RHS above. The value $\alpha$ is called the momentum. Sometimes, gradient descent tends to get stuck in local optima in the search space. A local optimum is a point where the value is low compared to the region around it, but not all of the domain overall. The momentum term causes the current weight update to be dependent on the last weight update (and therefore not be drastically different), thus preventing getting stuck in local optima in most cases. If it is indeed a global optimum, the search will eventually come back to the required point as the momentum will go on reducing. Nesterov Momentum is another way of optimizing the use of momentum in gradient descent.

How frequently are the weights updated?

There are 3 answers to this:

1. Online Mode

In this mode, the weights are updated after every training sample. In other words, the forward propagation as well as the backward propagation phases are run for each training sample, every single time. This mode is appropriate if you are receiving your training data sequentially. However, if your training set is large, this mode of training can be pretty time-consuming. Moreover, it is not appropriate to update weights after every sample (in most cases), due to the possible presence of outliers.

2. Batch Mode

In this mode, the weights are updated after accumulating gradients with respect to all samples in the training set. In other words, the forward propagation is run with respect to all samples, and the backward phase is run just once with respect to the accumulated results of the entire batch. This is done for multiple iterations (obviously). The batch mode of training is thus more resistant to variance in the training dataset.

3. Stochastic Mode

The stochastic mode adds randomization to the mix. In the mini-batch method of stochastic learning, small sets of samples are chosen randomly from the dataset, and training is done in batch mode over these mini-batches. Stochastic mode with mini-batches provides a nice trade-off between the pros and cons of online and batch modes of training. Moreover, since you are randomly choosing the mini batches (instead of organizing your dataset into pre-defined subsets), you avoid getting stuck in local optima. This is mainly because you are providing weight updates with respect to a potentially different subset of training samples during every iteration.

EDIT 14/01/2016

Xavier’s initialization

Usually, the weights in a Neural Network are initialized to random values prior to training. This is okay if your network is shallow. But for deep learning applications, where the number of hidden layers may be 3 or more, a wrong selection of initial weights can cause convergence to sub-optimal models.

For example, assume that you are using a sigmoid activation function and your initial weights are pretty small (in the 0-1 range). If you manually calculate the activation values in a neural network across 1-2 layers, you will realize that almost all neurons in the topmost layer give activations around 0.5 (the mean value for the sigmoid function). This phenomenon is called ‘saturation’. Its described well in this paper by Glorot and Bengio.

The solution to this problem, is Xavier’s initialization.  This technique tries to optimize the generation of initial weights, by applying simple statistics over the number of input and output connections corresponding to a neuron. A good intuitive explanation of the method is given at Andy’s blog. As a note to my future self, here’s the TensorFlow implementation of Xavier’s algorithm.