Weekly Review: 12/16/2017

Hebbian Learning in Neural Networks

One major difference between human and machine learning is the way we retain important aspects of our knowledge, as we gather more data. All throughout our life, we keep enforcing those concepts/facts which help us most in our day-to-day activities. ML algorithms are much less selective – if you train a NN to perform task A and then retrain it to perform task B, the parameters for A will be forgotten ‘uniformly’.

This paper tries to apply the Hebbian-learning principle of ‘Neurons that fire together, wire together’ to NNs. Its done as follows:

  1. While training for task A, the algorithm measures the ‘importance‘ of a parameter as the gradient of the L2 norm of the output-error. In essence, this quantifies the absolute change in output for a small change in the param-value.
  2. Now, while re-training for a different task, the importance values computed in the above step are used as regularization parameters. This penalizes changes to the important params for task A, even while training for task B – as a result, the NN still performs decently at task A even after being re-trained.

Distributed Tensorflow

This article gives a very good overview of distributed-computing mechanisms in TensorFlow. The main problem being tackled is the sharing of parameters/variables across different machines. Some take-aways:

  1. tf.Session, by itself, is like an isolated execution engine.
  2. However, multiple sessions can be made to share variables using tf.train.Servers, which are grouped together into ‘clusters’. All servers in the same cluster share variable values (this is done using namespaces).
  3. With tf.device, if you have multiple devices each with their own process, you can choose which one holds the original copy of a Variable.
  4. Each server is responsible for building its own Graph though – the elements of the graph can include process-specific params, as well as global ones shared across servers.

The post gives multiple small examples, as well as one cumulative piece detailed multiple feature – do take a look!

UNsupervised Image-to-Image Translation by Nvidia

Another good paper from NIPS2017. The problem addressed here is that of unsupervised image-to-image translation, also shortened as UNIT. Consider the conversion of a street-photo in sunny weather, to the same street on a rainy day.

If it was supervised, we would have pairs of photos of the same streets, in sunny & rainy weather. But such data is hard to come by, especially in the quantities needed for deep learning. So what if you just have a bunch of sunny street photos, and a set of rainy ones? (with no ‘common’ street). This is basically the problem being solved here.

Nvidia uses a VAE-GAN for this purpose, with some twists. Consider the sunny/rainy example from above:

  1. The raw image is first converted to a latent vector by a Variational Autoencoder.
  2. This latent vector is then given as input to the Generative part of a GAN.

The twist is that the last few layers of the VAE, and the first few layers of the GAN are shared by both sunny-to-rainy and rainy-to-sunny networks. Why so?

You can intuitively see that the VAE is converting the raw pixels into a vector encoding basic attributes of the street – irrespective of the weather. While the first few layers of the VAE deal with raw pixel data, the higher ones understand abstract street-attributes (which are weather-independent). As a result, the latter layers get shared.

The same logic is applied (but in reverse), in keeping the first few layers of the Generative network common.

Heres a video of the method to give you a taste of the results obtained. They are surprisingly good!

Fake AI porn (NSFW)

A Redditor going by the name of ‘deepfake’ uses Tensorflow-based deep learning to paste celebrity faces onto pornstar bodies in videos. While the results are not perfect, they are good enough to cause concerns over consent.

(To give you an example of the progress that has made in video manipulation, take a look at Face2Face – They use a video of some celebrity, and combine it with actions by a user on their live feed, to generate a video of the celebrity doing the same.).

Backpropagation for dummies

Backpropagation is a method of training an Artificial Neural Network. If you are reading this post, you already have an idea of what an ANN is. However, lets take a look at the fundamental component of an ANN- the artificial neuron.


The figure shows the working of the ith neuron (lets call it N_i) in an ANN. Its output (also called its activation) is a_i. The jth neuron N_j provides one of the inputs to N_i.

How does N_i produce its own activation?

1. N_i has stored in itself, a weightage assigned to each of its inputs. Lets say that the weightage assigned by N_i to N_j is w_{i, j}. As the first step, N_i computes a weighted sum of all its inputs. Lets call it in_i. Therefore,

in_i = \sum_{k \in Inputs(N_i)} w_{i, k}a_k … (1)

where Inputs(N_i) corresponds to all neurons that provide an input to N_i.

2. This sum, called the input function, is then passed on to N_i‘s activation function. It is called so, because it computes the activation that will be provided by N_i to the units it in turn acts as an input to. Therefore,

a_i = g(in_i) … (2)

Usually, the activation functions of all neurons in an ANN are the same, so we just write g instead of g_i.

There are many options for g, such as linear functions, rectified linear functions, hyperbolic tan, etc. For the sake of this tutorial, we will use the sigmoid function (I’ll explain why later). So we have

g(x) = \frac{1}{1 + e^{-x}} … (3)


Thus, every neuron in an ANN – except the ones in the lowest-level input layer, and bias neurons – takes in the activation of all its inputs, and provides its own activation as an output.

Once a neural network has been designed, it is the job of the training process to ensure that the individual weightages w_{i, j} given by each neuron to each of its inputs is set just right, so that the whole network gives the right output. Backpropagation is one of the ways to optimize those weights.

Whenever we talk of optimization, it is necessary to properly define what we are optimizing and how we are doing it.

What are we trying to optimize?

The error function. This function defines how much the output of our model (in this case an ANN) defers from the required output. For the sake of this tutorial, we will use the Mean-Squared Error(MSE) function. For example, let N_o be one of the output neurons of our ANN. For a given training sample, let the expected output be t_o. However, N_o provides an output equal to its activation value, that is a_o. Therefore, the error as defined by MSE would be

E = \frac{1}{2} (t_o - a_o)^2 … (4)

Since a_o by its definition depends on the weightages given by N_o to its inputs, the ultimate valye of E depends on those weightages too.

It is important to remember that every training sample will result in a different error value. It is backpropagation’s goal to minimize (bring as close to zero as possible) the value of E for all training samples on an average. The way this is done, as mentioned before, is by fine-tuning the values of the weights w_{i, j}.

But how does backpropagation fine tune these weightage values? By using a technique called Gradient Descent.

What is Gradient Descent?

Consider a function as follows:

f(x) = 3x + 4

The gradient of the function f with respect to the variable x defines how much the value of f will change with a unit increase/decrease in the value of x. For the above example, the answer to that is 3. Basically, you are differentiating f with respect to x. Generally speaking, a function may be dependent on a variety of parameters, so you take a partial derivative when computing the gradient.

Consider you are on a hill like the one shown in the image below:

Gradient_descent_1The red regions correspond to places of higher function value, while the blue regions correspond to low values. You want to reach the lowest point possible. If you are standing on the red hill, and you can only see a small part of the terrain around you, what will you do? You will naturally take steps in the direction in which the terrain slopes down. With each step you take, you will scan your immediate neighbourhood (which is visible to you), and go in the direction that shows the steepest descent.

In mathematical terms, this ‘direction’ is nothing but the gradient of your height (the target function you are trying to minimize), with respect to your location (x, y) (which are the parameters you are fine tuning). The overall direction in which you move, will be dependent on the gradients with respect to both dimensions, i.e. \frac{\partial height}{\partial x} and \frac{\partial height}{\partial y} . That’s pretty much what gradient descent does.

Its called so, because you are descending from a higher value of the target function to a lower value, by following the direction that corresponds to the lowest gradient (and consequently the steepest decrease in function value).

One thing to note is that you would always want to move in the direction shown by the negation of the gradient. Consider the above example of f(x) = 3x + 4 . The gradient here is 3, which is positive. So, with every unit increase in x, f(x) would increase by 3. Therefore, to reduce f(x), you would decrease the value of x . The opposite is true in scenarios where you want to maximize the target function. In those cases, you move in the direction of the gradient itself.

Coming back to the context of ANN training, backpropagation reduces the error function E by taking help of the gradients with respect to individual weights, \frac{\partial E}{\partial w_{i, j}} .

How does backpropagation employ gradient descent?

First off, it is necessary to keep in mind that backpropagation tries to optimize the inter-neuron weights, and not the neurons themselves. Their functionality is pretty much decided when you decide the activation function.

Secondly, the error function value E depends on the current values of the weights and the training sample in question, apart from the definition of the activation function. So to compute the error with respect to a training sample, you first feed it as an input to the network. You will then receive the appropriate output(s) from the output neuron(s). This part is known as  ‘forward propagation‘, since you are going from the input layer to the output layer (through the hidden layers if any). If you have multiple output neurons, the error values corresponding to each will affect the weights appropriately.

Then, you use the network output(s) along with the expected value(s) to calculate the errors using Equation 4 given above. Once the error values are known, the weights are updated using the gradients. This updating occurs first for the out-most weights, and then for lower layers. The intuition behind this, is the fact that the errors with respect to the outmost neurons are known first, and you derive the errors of the inner neurons from them. Each layer’s errors are calculated from the errors of the neurons they provide input to. That is why the second phase is called ‘backward propagation‘, or back-propagation.

So as a first step, lets try to understand how we will calculate the gradient \frac{\partial E}{\partial w_{i, j}} for a given training sample. Remember that  w_{i, j} is the weightage given by N_i to N_j.

Using the chain rule in Leibniz’s form,

\frac{\partial E}{\partial w_{i, j}} = \frac{\partial E}{\partial in_i} \frac{\partial in_i}{\partial w_{i, j}} … (5)

From Equations 1 and 5, and using basic differentiation,

\frac{\partial E}{\partial w_{i, j}} = a_j \frac{\partial E}{\partial in_i} … (6)

From the basic definition of the activation function (and the appropriate current weight values), you already know the value of a_j. Now consider the second term in the above product. Using the chain rule again on it, we get

\frac{\partial E}{\partial in_i} = \frac{\partial E}{\partial a_i} \frac{\partial a_i}{\partial in_i} … (7)

This is where the whole advantage of using the sigmoid activation function comes into play. The derivative of a sigmoid function with respect to its input, corresponds to a nice expression. Given equation 3, you can easily verify that

\frac{dg}{dx} = g(x)(1 - g(x))  … (8)

From Equations 7 and 8,

\frac{\partial E}{\partial in_i} = a_i (1 - a_i) \frac{\partial E}{\partial a_i} … (9)

Putting the information in Equations 6 and 9 together,

\frac{\partial E}{\partial w_{i, j}} = a_j a_i (1 - a_i) \frac{\partial E}{\partial a_i} … (10)

Look at the last term. There are two possible cases for N_i (N_i cannot be an input neuron, since it is receiving input from another neuron) :

I. N_i is an output neuron. In this case, the second term on the RHS is pretty simple, from Equation 4-

\frac{\partial E}{\partial a_i} = - (t_i - a_i) … (11)

where t_i is the expected output from N_i (for the current training sample).

2. N_i is an inner neuron. Consider all neurons N_k, for whom N_i acts an an input. Since we are propagating backward, you can therefore assume that we know the values \frac{\partial E}{\partial a_k} for all of them. In fact, since the output of N_i affects the outputs of all these N_k s, the expression for  \frac{\partial E}{\partial a_i} will involve all of them.

Using simple chain rule,

\frac{\partial E}{\partial a_i} = \sum_{k} \frac{\partial in_k}{\partial a_i} \frac{\partial E}{\partial in_k} … (12)

From Equations 12, 1 and 9,

\frac{\partial E}{\partial a_i} = \sum_{k} w_{k, i} a_k (1 - a_k) \frac{\partial E}{\partial a_k} … (13)

Equations 10, 11 and 13 pretty much define the gradient values \frac{\partial E}{\partial w_{i, j}} for all weights in a given ANN. So how are the weights actually updated?

Using the following rule:

The change caused by a given training sample, at time instance ‘t’, for the weight w_{i, j} is given by

\Delta w_{i, j}^{t} = - \epsilon \frac{\partial E}{\partial w_{i, j}} + \alpha \Delta w_{i, j}^{t - 1}

Remember me mentioning that we always need to move in the direction of the gradient’s negation? Thats why the minus sign for the first term. The value \epsilon is called the learning rate. We should never be too hasty in changing the weight’s value too drastically based on the gradient with respect to one training sample. Therefore, the gradient value is multiplied by the learning rate (usually well below 0.5) before causing the required change in the weightage value. If the learning rate is too low, the algorithm takes too long to train. On the other hand, if its too high, the algorithm keeps ‘bouncing around’ in the search space. Imagine you are running towards point B from point A. But you run so fast that you cant stop in time at point B, but you actually end up far ahead of it. Then you again run in the opposite direction towards B, once again crossing it and stopping ahead because of your high speed. Thats pretty much what happens if your learning rate is too large.

A variant of backpropagation, called resilient propagation(RPROP) improves on this classic method by having custom deltas with respect to each weightage in a network. It only uses the gradients as references to determine which direction to move in. Therefore, you don’t have to provide the learning rate as a parameter to resilient propagation.

Now look at the second term in the RHS above. The value \alpha is called the momentum. Sometimes, gradient descent tends to get stuck in local optima in the search space. A local optimum is a point where the value is low compared to the region around it, but not all of the domain overall. The momentum term causes the current weight update to be dependent on the last weight update (and therefore not be drastically different), thus preventing getting stuck in local optima in most cases. If it is indeed a global optimum, the search will eventually come back to the required point as the momentum will go on reducing. Nesterov Momentum is another way of optimizing the use of momentum in gradient descent.

How frequently are the weights updated?

There are 3 answers to this:

1. Online Mode

In this mode, the weights are updated after every training sample. In other words, the forward propagation as well as the backward propagation phases are run for each training sample, every single time. This mode is appropriate if you are receiving your training data sequentially. However, if your training set is large, this mode of training can be pretty time-consuming. Moreover, it is not appropriate to update weights after every sample (in most cases), due to the possible presence of outliers.

2. Batch Mode

In this mode, the weights are updated after accumulating gradients with respect to all samples in the training set. In other words, the forward propagation is run with respect to all samples, and the backward phase is run just once with respect to the accumulated results of the entire batch. This is done for multiple iterations (obviously). The batch mode of training is thus more resistant to variance in the training dataset.

3. Stochastic Mode

The stochastic mode adds randomization to the mix. In the mini-batch method of stochastic learning, small sets of samples are chosen randomly from the dataset, and training is done in batch mode over these mini-batches. Stochastic mode with mini-batches provides a nice trade-off between the pros and cons of online and batch modes of training. Moreover, since you are randomly choosing the mini batches (instead of organizing your dataset into pre-defined subsets), you avoid getting stuck in local optima. This is mainly because you are providing weight updates with respect to a potentially different subset of training samples during every iteration.

EDIT 14/01/2016

Xavier’s initialization

Usually, the weights in a Neural Network are initialized to random values prior to training. This is okay if your network is shallow. But for deep learning applications, where the number of hidden layers may be 3 or more, a wrong selection of initial weights can cause convergence to sub-optimal models.

For example, assume that you are using a sigmoid activation function and your initial weights are pretty small (in the 0-1 range). If you manually calculate the activation values in a neural network across 1-2 layers, you will realize that almost all neurons in the topmost layer give activations around 0.5 (the mean value for the sigmoid function). This phenomenon is called ‘saturation’. Its described well in this paper by Glorot and Bengio.

The solution to this problem, is Xavier’s initialization.  This technique tries to optimize the generation of initial weights, by applying simple statistics over the number of input and output connections corresponding to a neuron. A good intuitive explanation of the method is given at Andy’s blog. As a note to my future self, here’s the TensorFlow implementation of Xavier’s algorithm.