Weekly Review: 10/28/2017

This was a pretty busy week with a lot going on, but I finally seem to be settling into my new role!

The study for Aerial Robotics is almost over with a week to go. There hasn’t been much coding in this course, but that was to be expected since it was more about PID-Control Theory and quadrotor dynamics. I am particularly interested in the Capstone/’final’ project for this course, which would involve building an autonomous robot in Pi.

Anyway, on to the interesting tidbits from this week:

AlphaGo Zero

Google’s Deepmind recently announced a new version of their AI-based Go player, the AlphaGo Zero. What makes this one so special, is that it breaks the common notion of intelligent systems requiring a LOT of data to produce decent results. AlphaGo Zero was only provided the basic rules of Go, and it performed the rest of the learning all by playing against itself. Oh and BTW, AlphaGo Zero beats AlphaGo, the previous champion in the game. This is indeed a landmark in demonstrating the power of good-old RL.

Read this article for a basic overview, and their paper in Nature for a detailed explanation. Brushing up on Monte Carlo Tree Search would certainly help.

Word Mover’s Distance

Given an excellent embedding of words such as Word2Vec, it is not very difficult to compute the semantic distance between individual terms. However, when it comes to big blocks of text, a simple ‘average’ over term-embeddings isn’t good enough for computing their relative distances.

In such cases, the Word Mover’s Distance, inspired from Earth Mover’s Distance, provides a better solution. It figures out the semantically closest term(s) from one document to each term in another, and then the average effort required to ‘rephrase’ one text in words of another. Click on the article link for a detailed explanation.

Robots generalizing from simulations

OpenAI posted a blog article about how they trained a robot only through simulations. This means that the robot received no data from sensors during the training phase, but was able to perform basic tasks in deployment after some calibration.

During the simulations, they used dynamics randomization to alter basic traits of the environment. This data was then fed to an LSTM to understand the settings and goals. A key insight from this work is Hindsight Experience Replay. Quoting the article, “Hindsight Experience Replay (HER), allows agents to learn from a binary reward by pretending that a failure was what they wanted to do all along and learning from it accordingly. (By analogy, imagine looking for a gas station but ending up at a pizza shop. You still don’t know where to get gas, but you’ve now learned where to get pizza.)

Concurrency in Go

If you are a Go Programmer, take a look at this old (but good) talk on concurrency patterns and constructs in the language.

Generalization Bounds in Machine Learning

The Generalization Gap for an ML system is defined as the difference between the training error and the generalization error. The Generalization Bound tries to put a bound on this value, based on probability theory. Read this article for a detailed mathematical explanation.


Weekly Review: 10/21/2017

Its been a long while since I last posted, but for good reason! I was busy shifting base from Google’s Hyderabad office to their new location in Sunnyvale. This is my first time in the USA, so there is a lot to take in and process!

Anyway, I am now working on Google’s Social-Search and Ranking team. At the same time, I am also doing Coursera’s Robotics Specialization to learn a subject I have never really touched upon. Be warned if you ever decide to give it a try: their very first course, titled Aerial Robotics, has a lot of linear math and physics involved. Since I last did all this in my freshman year of college, I am just about getting the weeks done!

Since I already have my plate full with a lot of ToDos, but I also feel bad for not posting, I found a middle ground: I will try, to the best of my ability, to post one article each weekend about all the random/new interesting articles I read over the course of the week. This is partly for my own reference later on, since I have found myself going back to my posts quite a few times to revisit a concept I wrote on. So here goes:

Eigenvectors & Eigenvalues

Anything ‘eigen’ has confused me for a while now, mainly because I never understood the intuition behind the concept. The highest-rated answer to this Math-Stackexchange question did the job: Every square matrix is a linear transformation. The corresponding eigenvectors roughly describe how the transformation orients the results (or the directions of maximum change), while the corresponding eigenvalues describe the distortion caused in those directions.

Transfer Learning

Machine Learning currently specializes in utilizing data from a certain {Task, Domain} combo (for e.g., Task: Recognize dogs in photos, Domain: Photos of dogs) to learn a function. However, when this same function/model is used on a different but related task (Recognize foxes in photos) or a different domain (Photos of dogs taken during the night), it performs poorly. This article discusses Transfer Learning, a method to apply knowledge learned in one setting on problems in different ones.

Dynamic Filters

The filters used in Convolutional Neural Network layers usually have fixed weights at a certain layer, for a given feature map. This paper from the NIPS conference discusses the idea of layers that change their filter weights depending on the input. The intuition is this: Even though a filter is trained to look for a specialized feature within a given image, the orientation/shape/size of the feature might change with the image itself. This is especially true while analysing data such as moving objects within videos. A dynamic filter will then be able to adapt to the incoming data, and efficiently recognise the intended features inspite of distortions.


An introduction to Bayesian Belief Networks

A Bayesian Belief Network (BBN), or simply Bayesian Network, is a statistical model used to describe the conditional dependencies between different random variables.

BBNs are chiefly used in areas like computational biology and medicine for risk analysis and decision support (basically, to understand what caused a certain problem, or the probabilities of different effects given an action).

Structure of a Bayesian Network

A typical BBN looks something like this:


The shown example, ‘Burglary-Alarm‘ is one of the most quoted ones in texts on Bayesian theory. Lets look at the structural characteristics one by one. We will delve into the numbers/tables later.

Directed Acyclic Graph (DAG)

We obviously have one node per random variable.

Directed: The connections/edges denote cause->effect relationships between pairs of nodes. For example Burglary->Alarm in the above network indicates that the occurrence of a burglary directly affects the probability of the Alarm going off (and not the other way round). Here, Burglary is the parent, while Alarm is the child node.

Acyclic: There cannot be a cycle in a BBN. In simple English, a variable A cannot depend on its own value – directly, or indirectly. If this was allowed, it would lead to a sort of infinite recursion which we are not prepared to deal with. However, if you do realize that an event happening affects its probability later on, then you could express the two occurrences as separate nodes in the BBN (or use a Dynamic BBN).

Parents of a Node

One of the biggest considerations while building a BBN is to decide which parents to assign to a particular node. Intuitively, they should be those variables which most directly affect the value of the current node.

Formally, this can be stated as follows: “The parents of a variable X (parents(X)) are the minimal set of ancestors of X, such that all other ancestors of X are conditionally independent of X given parents(X)“.

Lets take this step by step. First off, there has to be some sort of a cause-effect relationship between Y and X for Y to be one of the ancestors of X. In the shown example, the ancestors of Mary Calls are Burglary, Earthquake and Alarm.

Now consider the two ancestors Alarm and Earthquake. The only way an Earthquake would affect Mary Calls, is if an Earthquake causes Alarm to go off, leading to Mary Calls. Suppose someone told you that Alarm has in fact gone off. In this case, it does not matter what lead to the Alarm ringing – since Mary will react to it based on the stimulus of the Alarm itself. In other words, Earthquake and Mary Calls become conditionally independent if you know the exact value of Alarm.

Mathematically speaking, P(Mary Calls|Alarm,Earthquake) == P(Mary Calls|Alarm).

Thus, parents(X) are those ancestors which do not become conditionally independent of X given the value of some other ancestor. If they do, then the resultant connection would actually be redundant.

Disconnected Nodes are Conditionally Independent

Based on the directed connections in a BBN, if there is no way to go from a variable X to Y (or vice versa), then X and Y are conditionally independent. In the example BBN, pairs of variables that are conditionally independent are {Mary Calls, John Calls} and {Burglary, Earthquake}.

It is important to remember that ‘conditionally independent’ does not mean ‘totally independent’. Consider {Mary Calls, John Calls}. Given the value of Alarm (that is, whether the Alarm went off or not), Mary and John each have their own independent probabilities of calling. However, if you did not know about any of the other nodes, but just that John did call, then your expectation of Mary calling would correspondingly increase.

Mathematics behind Bayesian Networks

BBNs provide a mathematically correct way of assessing the effects of different events (or nodes, in this context) on each other. And these assessments can be made in either direction – not only can you compute the most likely effects given the values of certain causes, but also determine the most likely causes of observed events.

The numerical data provided with the BBN (by an expert or some statistical study) that allows us to do this is:

  1. The prior probabilities of variables with no parents (Earthquake and Burglary in our example).
  2. The conditional probabilities of any other node given every value-combination of its parent(s). For example, the table next to Alarm defines the probability that the Alarm will go off given the whether an Earthquake and/or Burglary have occurred.

In case of continuous variables, we would need a conditional probability distribution.

The biggest use of Bayesian Networks is in computing revised probabilities. A revised probability defines the probability of a node given the values of one or more other nodes as a fact. Lets take an example from the Burglary-Alarm BBN.

Suppose we want to calculate the probability that an earthquake occurred, given that the alarm went off, but there was no burglary. Essentially, we want P(Earthquake|Alarm,\sim Burglary). Simplifying the nomenclature a bit, P(E|A,\sim B).

Here, you can say that the Alarm going off (A) is evidence, the knowledge that the Burglary did not happen (\sim B) is context and the Earthquake occurring (E) is the hypothesis. Traditionally, if you knew nothing else, P(E) = 0.002, from the diagram. However, with the context and evidence in mind, this probability gets changed/revised. Hence, its called ‘computing revised probabilities’.

A version of Bayes Theorem states that

P(X|YZ) = \frac{P(X|Z)P(Y|XZ)}{P(Y|Z)} …(1)

where X is the hypothesis, Y is the evidence, and Z is the context. The numerator on the RHS denotes that probability that XY both occur given Z, which is a subset of the probability that atleast Y occurs given Z, irrespective of X.

Using (1), we get

P(E|A, \sim B) = \frac{P(E|\sim B)P(A|\sim B, E)}{P(A|\sim B)} …(2)

Since E and B are independent phenomena without knowledge of A,

P(E|\sim B) = P(E) = 0.002 …(3)

From the table for A,

P(A|\sim B, E) = 0.29 …(4)

Finally, using the Total Probability Theorem,

P(A| \sim B) = P(E) P(A| E, \sim B) + P(\sim E) P(A| \sim E, \sim B) …(5)

Which is nothing but average of P(A| E, \sim B)P(A| \sim E, \sim B), weighted on P(E)P(\sim E) respectively.

Substituting values in (5),

P(A| \sim B) = 0.002 * 0.29 + 0.998 * 0.001 = 0.001578  …(6)

From (2), (3), (4), & (6), we get

P(E|A, \sim B) = 0.367

As you can see, the probability of the Earthquake actually increases if you know that the Alarm went off but a Burglary was not the cause of it. This should make sense intuitively as well. Which brings us to the final part –

The ‘Explain Away’ Effect

The Explain Away effect, commonly associated with BBNs, is a result of computing revised probabilities. It refers to the phenomenon where knowing that one cause has occurred, reduces (but does not eliminate) the probability that the other cause(s) took place.

Suppose instead of knowing that there has been no burglary like in our example, you infact did know that one has taken place. It also led to the Alarm going off. With this information in mind, your tendency to check out the ‘earthquake’ hypothesis reduces drastically. In other words, the burglary has explained away the alarm.

It is important to note that the probability for other causes just gets reduced, but does NOT go down to zero. In a stroke of bad luck, it could have happened that both a burglary and an earthquake happened, and any one of the two stimuli could have led to the alarm ringing. To what extent you can ‘explain away’ an evidence depends on the conditional probability distributions.

Residual Neural Networks as Ensembles

In a previous blog post, I had mentioned Residual Connections in the context of Google-Neural Machine Translation. I was not completely familiar with the intuition behind Residual Networks then, so heres a short post on what I gathered after reading some literature.

Residual Networks first got attention after this paper – “Deep Residual Learning for Image Recognition” by some folks from Microsoft Research. They used what they called residual connections in Convolutional Neural Networks, obtaining 1st-place results at ILSVRC 2015. However, it turns out that their notion of why the networks achieved such great performance might have been flawed.

The original authors thought that the new connections allowed training of very deep neural networks (their state-of-the-art model had 152 layers) by ‘preserving’ gradient across layers during training. This, according to them, solved the Vanishing Gradient Problem. However, the paper “Residual Networks Behave Like Ensembles of Relatively Shallow Networks” challenges this, attempting to prove that Residual NNs actually behave like ensembles of neural networks!

Lets take this step-by-step.

First, lets establish some nomenclature. In a classic neural network, suppose the output from the nth layer is written as y_n. Mathematically, you can write:

y_n = f_n(y_{n-1}) …(1)

Here, f_n is a vectorial function that computes y_n from y_{n-1}. In a standard neural network, the mathematical computation performed by f is same across all layers (like the ReLU). What differentiates any f_n from other layers is the set of weights assigned to every element of y_{n-1} towards computing elements of y_n. These weights are of course learnt during training.

A typical layer in a Residual NN looks like this (taken from the original paper):


Based on our nomenclature, the expression for y_n now becomes:

y_n = f_n(y_{n-1}) + y_{n-1} …(2)

The difference, as you must have noticed, lies in the second term on the Right-Hand-Side. This term is what they call the identity skip(or short-cut) connection. ‘Skip‘, because you are skipping the f_n computation, and ‘identity‘ because you are not multiplying y_{n-1} by any set of weights before addition.

These identity short-cut connections are what make residual NNs special.

‘Unravelling’ a Residual Neural Network

The reference paper‘s biggest insight lies in the way they look at equation (2) – that is, as a recursive definition. They call this unravelling the NN.

Lets consider a simple residual NN with three layers. The input can be denoted as y_0, with the outputs from the three subsequent layers being y_1y_2 and y_3 respectively. y_3 is the output of the NN as a whole.


Expanding equation (2) step-by-step, we would get

y_3 = f_3(y_2) + y_2 …(3)

y_3 = f_3(f_2(y_1) + y_1) + f_2(y_1) + y_1 …(4)

y_3 = f_3(f_2(f_1(y_0) + y_0) + f_1(y_0) + y_0) + f_2(f_1(y_0) + y_0) + f_1(y_0) + y_0 …(5)

Equation (5) essentially defines y_3 as the sum of outputs from every single layer (including y_0) of the neural network.

Contrast this with what we would see if it was a standard network:

y_3 = f_3(f_2(f_1(y_0))) …(6)

(Ofcourse, the weights learnt would be different in each case)

Consider a situation where y_0 and the weights for f_2 are such that f_2(y_1) turns out to be a vector of very low values.

If this happens in equation (6), you see that there is very little f_3 can do. Since there is only one term on the RHS, any input basically follows one path through the NN.

But now consider equation (5). If f_2(y_1) is close to a zero vector, we still get

y_3 = f_3(f_1(y_0) + y_0) + f_1(y_0) + y_0

It looks like the input signal bypassed the second layer completely on its way to the output! In fact, if we did not have f_2 at all, you would derive the above expression for the network. This flexibility (i.e. the ability to bypass any layer) is provided to the NN by the skip-connections.

And the ability to ‘skip’ is not just restricted to one layer per run- depending on the input and the weights at every level, any combination of them could be on/off for a given case. The name residual can be interpreted better now – residual can mean ‘unused’, which is pretty much what f_2 is in the above example! That is not the case for equation (6), since the network does not have the choice to ignore f_2‘s output in the single term on the RHS.

Diagrammatically, you could ‘unravel’ the network like this:


Each ‘path’ in the above figure denotes some combination of the layers working on y_0 as it makes its way to the output. Every node denotes convergence (addition) of a set of paths, before passing on the sum to the next layer.

Ideally, every output would get significant contributions from every one of those paths. But that is not usually the case, as observed by the authors of the paper.

Now lets see what happens to the figure the moment you switch off f_2, cutting off all corresponding paths:


As you can see, even though f_2 makes no contributions, the input still makes its way through the remaining paths to generate output.

Intuitively, you can see why the authors think of this structure as an ensemble. At every single node (including the last one), there are multiple paths, any of which may or may not contribute to the overall output for a given scenario. You can say that every y_0 activates a unique set of paths in a residual NN.

And how many such paths are there? Its equivalent to the total cases resulting from every one of the layers being either on or off. Simple probability theory will tell you that its 2^N, where N is the total number of layers in the NN.

Unusual properties of Residual NNs

Thinking of residual NNs as ensembles motivated the authors to conduct some experiments to test their hypothesis. Essentially, what they tried to do is see which properties of ensembles residual NNs satisfy:

1. Resilience to layer deletion: In most Neural Networks (the ones that resemble equation (2)), deleting a layer has disastrous results on the outputs – whatever the size of the overall network may be. And for good reason, as you are disrupting the one term that corresponds to the output.

But that is not the case for Residual Neural Networks. In fact, deleting 1 (or even 2 or 3) layers in large residual NNs introduces only around 6-7% of an error into the performance of the network! Moreover, deleting more and more layers actually has a pretty smooth (as in mathematically smooth) effect on the total error:


This is pretty close to how an ensemble behaves – deleting models from an ensemble does introduce error, but the increase is never drastic with respect to the number of models removed.

This can be explained easily by looking at equation (5). Even if you delete one layer from a residual NN, that still leaves 2^{N-1} terms on the RHS of the output. A bunch of them even yield the same result as they would have, with the layer in.

2. Shortening of effective paths: This is actually contradictory to what people first believed about residual NNs – that they promote deeper networks.

During training, the authors observed that the updates were not happening uniformly across all layers, as they would for normal NNs. Every training point would adjust the weights along a specific set of ‘paths’ as shown in the unravelled diagram. And most of these paths, even with 152-layer deep networks, were only 20-30 levels deep!

Even on-line, every input activated only a specific set of paths without taking significant contributions from every single layer and path in the network.

This is where the biggest revelation lies: Residual NNs work better not by increasing the effective paths, but actually reducing them! What works here is that every input has a chance to take its own unique set of paths to the output, without having to go through every single layer.

This is again similar to how ensembles are trained and run. During training, you won’t observe all smaller models in an ensemble getting significantly adjusted for every training point. On-line, there will always only be a subset of models that give a strong output for any input.


Thats it for now! Do read the reference paper if you feel interested, or go through the original paper to see their usage in the image processing scenario.

How Neural Networks generate Visual Art from inspiration

Since my last blog post on Google Translate, I have been reading the earlier articles on Google’s Research Blog. Their work on generative AI particularly caught my eye, where they have tried building models to create art/imagery using deep learning.

Announced back in 2015, DeepDream has fascinated a lot of people with its ability to interpret images in fascinating ways to ‘dream up’ complicated visuals where none exist.


Talking about creating beautiful pictures, we also had apps like Prisma and DeepForger that transformed user-given photos in the manner of famous/standard art-styles to create some stunning work.


In this post, I attempt to give an intuitive explanation for this paper: A Neural Algorithm of Artistic Style by Gatys, Ecker and Bethge. The aim of this work is pretty similar to what Prisma actually does, i.e. combining the content from one image with the artistic style of another to fabricate a new image. On the way we will also get some glimpse into how DeepDream works.

Convolutional Neural Networks

Before we delve into creation of images, lets get a high-level understanding of how deep learning typically understands them. Convolutional Neural Networks (CNNs) are state-of-the-art when it comes to image analysis. Assuming you know what a basic Neural Network is, heres a simplified depiction of a Convolutional Network:


Layers 1 & 2 are what make CNNs special; the final ‘classifier’ is just a standard fully-connected network.

Both layer 1 and 2 are performing two different operations on the input:

  1. Convolution
  2. Pooling

In the Convolution step, we compute a set of Feature Maps using the previous layer. A Feature Map typically has the same dimensions as the input ‘image’, but there’s a difference in the way its neurons are connected to the preceding layer. Each one is only connected to a small local area around its position (see image). Whats more, the set of weights that every neuron uses is the same. This set of shared weights is also called a filter.


Intuitively, you can say that each node in the Feature Map is essentially looking for the same concept, but in a limited area. This gives CNNs a very powerful trait: the ability to detect features irrespective of their position in the actual image. Since every neuron is trained to detect the same entity (shared weights), one or the other will fire incase the corresponding object happens to be in the input – irrespective of the exact location. Also worth noting is the fact that neighboring neurons in the Map will analyze partially intersecting portions of the previous layer, so we haven’t really done any hard ‘segmentation’.

In the set of Feature Maps at a particular level, each one looks for their own concept which they learnt during training. As you go higher and higher up the overall layers, these sets of Maps start looking for progressively higher-level objects. The first set (in the lowest layer) might look for lines/circles/curves, the next one might detect shapes of eyes/noses/etc, while the topmost layers will ultimately understand complete faces (an over-simplification, but you get the idea). Something like this:


Pooling – You can think of Pooling as a sort of compression operation. What we basically do is divide each Feature Map into a set of non-overlapping ‘boxes’ and replace each box with a representative based on the values inside it. This representative could either be the maximum value (called Max-Pooling) or the mean (called Average-Pooling). The intuition behind this step is to reduce noise and retain the most interesting parts of the data (or summarize it) to provide to the next layer. It also allows the future layers to analyze larger portions of the image without having to increase filter size.


Typical CNNs used in deep learning have multiple such Convolution + Pooling layers, each caring lesser and lesser about the actual pixel values and more about the general content of the image. Feature Maps at Layer N+1 will take inputs from all the compressed/pooled maps from Layer N in a typical scenario. Moreover, the number of Feature Maps at each layer is not a constant, and is usually decided by trial-and-error (as are most design decisions in Machine Learning).

Recreating the Content of an Image

Neural networks in general have a very handy property: The ability to work in reverse (well, sort-of). Basically, “How do I change the current input so that it yields a certain output?“. Lets see how.

Consider a CNN C, trained to recognize animals in input images. Given a genuine photo of a dog, the CNN might be able to classify it correctly by virtue of its convolutional layers and the final classifier. But now suppose I show it an image of just…clouds. Forget the final classifier, the intermediate layers are more interesting here. Since C was originally trained to look for features of animals, that is exactly what it will try to do here! It might interpret random clouds and shapes as animals/parts of animals – a form of artificial pareidolia (the psychological phenomenon of perceiving patterns where none exist).

You can actually visualize what a particular layer of the CNN interprets from the image. Suppose the original cloud-image was I_c:


Say at a certain level l of C, the Feature Maps gave an output  F_l based on I_c.

What we will do now, is provide C with a white-noise image I_n:


This sort-of works like a blank-slate for C, since it has no real information to interpret (though C can still ‘see’ patterns, but very very vaguely). Now, using the process of Gradient Descent, we can make C modify I_n so that it yields an output close to F_l at level l.

What it essentially does, is iteratively shift the pixel values of I_n until its output at l is similar to that of I_c. One key point: Even after the end of this process, I_n will not really become the same as I_c. Think about it – you have recreated I_c based on the CNN’s interpretation of I_c, which involves a lot of intermediate convolutions and pooling. The higher the level l you choose for re-creating the image, the deeper the pareidolia based on the CNN’s training – or more ‘abstract’ the interpretations.

In fact, this is pretty similar to what DeepDream does for understanding what a deep CNN has ‘learnt’ from its training. The cloud image I showed earlier was indeed used with a CNN trained to recognize animals, leading to some pretty weird imagery:



Now, the paper we use as reference wants to recreate the content of an image pretty accurately, so how do we avoid such misinterpretation of shapes? The answer lies in the use of a powerful CNN trained to recognize a wide variety of objects – like the one developed by Oxford’s Visual Geometry Group (VGG) – VGGNet. VGGNet is freely available online, pre-trained and ready-made (Tensorflow example).

Recreating the Style of an Image

In the last section, we saw that the output from Feature Maps at a certain level (F_l) could be used as a ‘goal’ to recreate an image with conceptually similar content. But what about style or texture?

Intuitively speaking, the style of an image is not as much about the actual objects in it, but rather the co-occurrence of features/shape in the overall visual (Reference). This idea is quantified by the Gramian matrix with respect to the Feature Maps: G(F_l).

Suppose we have n different Feature Maps at level l of CNN C. G(F_l) is a matrix of dimensions n X n, with the element at position [i, j] being the inner product between Feature Maps i and j. Quoting an answer from this Stack-Exchange question, “the inner product between x and y is indicative of how much of y could be described using x“. Essentially, in this case, it quantifies how similar are the trends between the numbers present in Feature Maps i and j (“do triangles and circles occur together in this image?”).

Thus, G(F_l) is used as the Gradient-Descent ‘goal’ instead of F_l while re-creating the artistic style of a photo/image.

The following stack shows style (not content) recreations of the Composition-VII painting by Kandinsky . As you go lower, the images are based on progressively higher/deeper layers of the CNN:


As you will notice, higher layers tend to reproduce more complex and detailed strokes from the original image. This could be attributed to the capture of more high-level details by virtue of feature-extraction and pooling in the Convolutional Network.

Combining Content and Style from two different Images

That brings us to the final part – combining the above two concepts to achieve something like this:


Gradient Descent always considers a target ‘error function’ to minimize while performing optimization. Given two vectors x and y, let this function be denoted by \Lambda(x, y).

Suppose you want to generate an image that has the content of image I_c in the style of image I_s. The white-noise image you start out with, is I_n. Let F^{I} be the output given by a certain set of feature maps based on image I.

Now, if you were only looking to recreate content from I_c, you would be minimizing:

\Lambda(F^{I_n}, F^{I_c})

If you were only interested in the style from I_s, you would minimize:

\Lambda(G(F^{I_n}), G(F^{I_s}))

Combining the two, you get a new function for minimizing:

\alpha*\Lambda(F^{I_n}, F^{I_c}) + \beta*\Lambda(G(F^{I_n}), G(F^{I_s}))

\alpha and \beta are basically the weightage you give to the content and style respectively.

The tiles shown below depict output from the same convolutional layer, but with higher values of \alpha / \beta as you go to the right:


Pretty cool, isn’t it?

Understanding the new Google Translate


Google launched a new version of the Translate in September 2016. Since then, there have been a few interesting developments in the project, and this post attempts to explain it all in as simple terms as possible.

The earlier version of the Translate used Phrase-based Machine Translation, or PBMT. What PBMT does is break up an input sentence into a set of words/phrases and translate each one individually. This is obviously not an optimal strategy, since it completely misses out on the context of the overall sentence. The new Translate uses what Google calls Google Neural Machine Translation (GNMT), an improvement over a traditional version of NMT. Lets see how GNMT works on a high-level:

The Encoder

Before you understand the encoder, you must understand what an LSTM (Long-Short-Term-Memory) cell is. It is basically a Neural Network with some concept of memory. An LSTM is generally used to ‘learn’ patterns in time-series/temporal data. At any given point, it accepts the latest input vector and produces the intended output using a combination of (the latest input + some ‘context’ regarding what it saw before):


In the above picture, x_t is the input at time t. h_{t-1} represents the context from t-1. If x_t has a dimensionality of d, h_{t-1} of dimensionality 2d is a concatenation of two vectors:

  1. The intended output by the same LSTM at the last time-step t-1 (the Short Term memory), and
  2. Another d-dimensional vector encoding the Long Term memory – also called the Cell State.

The second part is usually not of use for the next component in the architecture. It is instead used by the same LSTM for the following step. LSTMs are usually trained by providing them with a ton of example input-series with the expected outputs. This enables them to learn what parts of the input to retain/hold, and how to mathematically process x_t and h_{t-1} to come up with h_t. If you wish to understand LSTMs better, I recommend this blog post by Christopher Olah.

An LSTM can also be ‘unfolded’, as shown below:


Don’t worry, they are copies of the the same LSTM cell (hence same training), each feeding their output to the next one in line. What this allows us to do is give in the entire set of input vectors (in essence, the whole time-series) all at once, instead of going step-by-step with a single copy of the LSTM.

GNMT’s encoder network is essentially a series of stacked LSTMs:


Each horizontal line of pink/green boxes is an ‘unfolded’ LSTM on its own. The above figure therefore has 8 stacked LSTMs in a series. The input to the whole architecture is the ordered set of tokens in the sentence, each represented in the form of a vector. Mind you, I said tokens – not words. What GNMT does in pre-processing, is break up all words into tokens/pieces, which are then fed as a series to the neural network. This enables the framework to (atleast partially) understand unseen complicated words. For example, suppose I say the word ‘Pteromerhanophobia‘. Even though you may not know exactly what it is, you can tell me that it is some sort of fear based on the token ‘phobia‘. Google calls this approach Wordpiece modeling. The break-up of words into tokens is done based on statistical learning (which group of tokens make most sense?) from a huge vocabulary in the training phase.

When you stack LSTMs, each layer learns a pattern in the time series fed to it by the earlier (lower) layer. As you go higher up the ladder, you see more and more abstract patterns from the data that was fed in to the lowest layer. For example, the lowest layer might see a set of points and deduce a line, the next layer will see a set of lines and deduce a polygon, the next will see a set of polygons and learn an object, and so on… Ofcourse, there is a limit to how many and in what way you should stack LSTMs together – more is not always better, since you will ultimately end up with a model thats too slow and difficult to train.

There are a few interesting things about this architecture shown above, apart from the stacking of LSTMs.


You will see that the second layer from the bottom is green in color. This is because the arrows – the ordering of tokens in the sentence – is reversed for this layer. Which means that the second LSTM sees the entire sentence in reverse order. The reason to do this is simple: When you look at a sentence as a whole, the ‘context’ for any word is not just contained in the words preceding it, but also in the words following it. The two bottom-most layers both see the raw sentence as input, but in opposite order. The third LSTM gets this bidirectional input from the first two layers – basically, a combination of the forward and backward context for any given word. Each layer from this point on learns higher-level patterns in the contextual meanings of words in the sentence.

You might also have noticed the ‘+’ signs that appear before providing inputs to the fifth layer and above. This is a form of Residual Learning. This is what happens from layer 5 onwards: For every layer N+1, the input is an addition of the output of layers N and N-1. Take a look at my post on Residual Neural Networks to get a better understanding of what this does.


Lastly, you can see the extra <2es> and </s> characters at the end of the input to the encoder. </s> represents ‘end of input’. <2es>, on the other hand, represents the Target Language – in this case, Spanish. GNMT does this unique thing where they provide the Target Language as input to the framework, to improve performance of Translate. More on this later.

Attention Module and the Decoder


The Encoder produces a set of ordered output-vectors (one for each token in the input). These are then fed into the Attention Module & Decoder framework. To a large extent, the Decoder is similar to the Encoder in design- stacked LSTMs and residual connections. Lets discuss the parts that are different.

I have already mentioned that GNMT considers the entire sentence as input, in every sense. However, it is intuitive to think that for every token that the decoder will produce, it should not give equal weightage to all vectors(tokens) in the input sentence. As you write out one part of the story, your focus should slowly drift to the rest of it. This work is done by the Attention Module. What the Attention Module gets as input, is the complete output of the Encoder and the latest vector from the Decoder stack. This lets it ‘understand’ how much/what has already been translated, and it then directs the Decoder to shift attention to the other parts of the Encoder output.

The Decoder LSTM-stack keeps outputting vectors based on the input from the Encoder and directions from the Attention module. These vectors are given to the Softmax Layer. You can think of the Softmax Layer as a Probability distribution-generator. Based on the incoming vector from the topmost LSTM, the Softmax Layer assigns a probability to every possible output token (remember the target language was already provided to the Encoder, so that information has already been propagated). The token that gets the maximum probability is written out.

The whole process stops once the Decoder/Softmax decides that the current token is </s> (or end-of-sentence). Note that the Decoder does not have to follow a number of steps equal to the output vectors from the Encoder, since it is paying weighted attention to all of those at every step of computation.

Overall, this is how you can visualize the  complete translation process:


Training & Zero-Shot Translation

The complete framework (Encoder+Attention+Decoder) is trained by providing it a huge collection of (input, translated) pairs of sentences. The architecture ‘knows’ the input language in a sense when it converts tokens from the incoming sentence to the appropriate vector format. The target language is provided as a parameter as well. The brilliance of deep-LSTMs lies in the fact that the neural network learns all of the computational stuff by itself, using a class of algorithms called Backpropagation/Gradient Descent.


Heres another amazing discovery made by the GNMT team: Simply by providing the target language as an input to the framework, it is able to perform Zero-Shot Translation! What this basically means is: If during training you provide it examples of English->Japanese & English->Korean translations, GNMT automatically does Japanese->Korean reasonably well! In fact, this is the biggest achievement of GNMT as a project. The intuition: what the Encoder essentially produces is a form of interlingua (or universal language). Whenever I say ‘dog‘ in any language, you end up thinking of a friendly canine – essentially, the concept of ‘dog‘. This ‘concept’ is what is produced by the Encoder, and it is irrespective of any language. In fact, some articles went so far as to say that Google’s AI had invented a language of its own :-D.

Providing the target language as input allows GNMT to easily use the same neural network for training with any pair of languages, which in turn allows zero-shot translations. As a result, the new Translate gets closer than ever before to the way humans perform translations in their mind.

Heres some references if you want to read further on this subject 🙂 :

  1. First blog post about GNMT on the Google Research Blog. (Corresponding Research Paper)
  2. Second blog post about Zero-Shot Translations. This one made the biggest splash. (Corresponding Research Paper)
  3. A great NYTimes article that tells the story behind this Google Translate.

On Interpretable Models

Artificial Intelligence is everywhere today. And as intelligent systems get more ubiquitous, the need to understand their behavior becomes equally important. Maybe if you are developing an algorithm to recognize a cat-photo for fun, you don’t need to know how it works as long as it delivers the results. But if you have deployed a model to predict whether a person will default on a loan or not, and you use it to make your decisions, you better be sure you are doing the right thing – for practical, ethical AND legal reasons.

From Dictionary.com,

interpretabilityto give or provide the meaning of; explain; explicate; elucidate


Why do models need to be interpretable?

The primary reason why we need explainability in AI, is to develop a sense of understanding and trust. Think about it – the only way you would ever delegate an important task to someone else, is if you had a certain level of trust in their thought process. If for instance Siri makes an error in understanding your commands, thats fine. But now consider self-driven cars. The reason why most people would not readily sit in a self-driven car within a city, is because we cannot guarantee if it will do the right thing in every situation. Interpretability is thus crucial for building trust towards models, especially in domains like healthcare, finance and the judicial system.

Interpretability is also important while debugging problems in a model’s performance. These problems might be caused due to the algorithm itself, or the data being used to train it. And you may not really observe these issues until you deploy a compound system that uses this model. Lets take the example of Google’s Word2Vec. Word2Vec is currently one of the best algorithms for computing word-embeddings given a significant amount of text. It was originally trained on a 3 million-word corpus of Google News articles. In some research conducted by people from Boston university and Microsoft Research, they found a ton of hidden sexism in the word-embeddings generated from that dataset. For example, the framework came up with this particular analogy: “man : computer programmer :: woman : homemaker”. Funny ain’t it? This was not a problem with the algorithm itself, but a screw-up of the way news articles are usually written. Quoting the source, “Any bias contained in word embeddings like those from Word2vec is automatically passed on in any application that exploits it.”.

How do we increase interpretability of models?

There are two ways to promote interpretability when it comes to Machine Learning/AI systems: Transparency, and Post-Hoc Explainability. Algorithmic transparency would mean that you understand the way your model works on an intuitive level, with respect to the dataset you used for training. A Decision Tree, for example, is pretty transparent – in fact, you can use the paths from the root to every leaf node to decompose the Tree into the set of rules used for classification. But a deep Neural Network is not so transparent, for obvious reasons. Though you may understand linear algebra and back-propagation, you will typically not be able to make sense of the weights/biases learned by a deep-NN after training.

Transparency has two aspects: Decomposability, and Simultaneity. Decomposability would mean understanding each individual component of your model. In essence, there should not be a ‘black-box’ component of the system in your eyes. Simultaneity, on the other hand, indicates an understanding of how all these individual components work together as a whole. And the former does not necessarily imply the latter – consider an algorithm as simple as linear regression. You would probably know that if the weight with respect to a predictor is positive after training, it shows a direct proportionality to the target variable. Now, if you train a simple linear regression of Disease-risk vs Vaccination, you would most probably get a negative weight on the Vaccination variable. But if you now take tricky factors such as immunodeficiency or age (old-age or infancy) into the picture, the weight might take on a whole different value. In fact, as the number of predictor variables goes on increasing in regression, it gets more and more difficult to understand how your model will behave as a whole. And thus, the notion that a ‘simple’ model (linear regression) would be far easier to interpret than a ‘complex’ one (deep learning) is misleading.

Post-Hoc means ‘occurring after an event’. In the context of model transparency, post-hoc interpretation would mean an effort to understand its behavior after it has finished training, typically using some test inputs. Some models/algorithms inherently have the ability to ‘explain’ their behavior. For example, take k-NN classifiers. Along with the required output, you can hack (with minimal effort) the model to return the k-nearest neighbors as examples for scrutiny. This way, you get a good idea of the combination of properties that produce similar results by looking at the known training points.

Most algorithms don’t have such easy post-hoc interpretability, though. In such cases, you have to use techniques such as visualization to understand how they behave/work. For instance, you could use a dimensionality reduction technique such as t-SNE to reduce vector points to 2-3 dimensions and visualize class ‘regions’ in 2D/3D space. Essentially, you are enabling easy visualization of higher-dimensional data by embedding it in a lower-dimensional space. Saliency maps are a technique used to interpret deep neural networks. In Natural Language Processing, textual explanations are also being adopted. Since humans usually understand words better than raw numbers, providing text-based explanations makes sense. For example, in a system like LSI, you could ‘understand’ a word’s embedding by (proportionately) looking at the words that strongly belong to the latent topic(s) is most relates to.

Conclusion and further reading

I did kind-of imply that interpretability is required so that we end up trusting automated systems as much as we trust humans. But as it turns out, its not like human actions are perfectly explainable. There is a ton of research in psychology that clearly indicates that the motivations for our actions are not as clear as we ourselves tend to believe. The Illusion of Conscious Will by Daniel Wegner talks about how our decisions tend to be influenced by subconscious processes without us realizing it. Moreover, it seems contradictory to the ultimate aim of AI to avoid building models that we cannot ‘understand’. If there will be machine intelligence smarter than us, the likelihood of us understanding it completely is pretty slim (Terminator, anyone?).

Heres a couple of links for you to look at, if you want to read more:

  1. EU’s General Data Protection Regulation (GDPR) makes interpretability a top priority.
  2. David Gunning’s summarization of the work on Explainable AI at the International Joint Conference on Artificial Intelligence.
  3. “Is Artificial Intelligence Permanently Inscrutable?” by Aaron Bornstein (Has some good discussion in the Comments section)