# Weekly Review: 12/03/2017

Missed a post last week due to the Thanksgiving long weekend :-). We had gone to San Francisco to see the city and try out a couple of hikes). Just FYI – strolling around SF is also as much a hike as any of the real trails at Mt Sutro – with all the uphill & downhill roads! As for Robotics, I am currently on Week 3 of the Mobility course, which is more of physics than ‘computer science’; its a welcome change of pace from all the ML/CS stuff I usually do.

Numenta – Secret to Strong AI

In this article, Numenta‘s cofounder discusses what we would need to push current AI systems towards general intelligence. He points out that many industry experts (including Jeff Bezos & Geoffrey Hinton) have opined that it would take far more than scaling up current intelligent systems, to achieve the next ‘big leap’.

Numenta’s goal as such is to take inspiration from the human brain (especially the neocortex) to design the next generation of machine intelligence. The article describes how the neocortex uses abstract ‘locations’ to understand sensory input and form mental representations. To read more of Numenta’s research, visit this page.

Transfer Learning

This article, though not presenting any ‘new findings’, is a fun-to-read introduction to Transfer Learning. It focusses on the different ways TL can be applied in the context of Neural Networks.

It provides examples of how pre-trained networks can be ‘retrained’ over new data by freezing/unfreezing certain layers during backpropagation. The blogpost also provides a bunch of useful links, such as this discussion on Stanford CS231.

Structured Deep Learning

This article motivates the need for embedding vectors in Deep Learning. One of the challenges of using SQL-ish data for deep learning, is the involvement of categorical attributes. The usual ways of dealing with such variables in ML is to use one-hot encodings, or find an integer representation for each possible value.

However, 1) one-hot encodings increase the memory footprint of a NN & 2) assigning integers to ordinal values implies a wrong meaning to neural networks, which are inherently continuous/numeric in nature. For example, Sunday=1 & Saturday=7 for a ‘week’ enum might lead the NN to believe that Sundays and Saturdays are very far apart, which is not usually true.

Hence, learning vectorial embeddings for ordinal attributes is perhaps the right way to go for most applications. While we usually know embeddings in the context of words (Word2Vec, LDA, etc), similar techniques can be used to other enum-style values as well.

Population-based Training

This blog-post by Deepmind presents a novel approach to coming up with the hyperparameters for Neural-Network training. It essentially brings in the methodology of Genetic Algorithms for designing optimal network architectures.

While standard hyperparameter-tuning methods perform some kind of random search, Population-based training (PBT) allows each candidate ‘worker’ to take inspiration from the best candidates in the current population (similar to mating in GAs) while allowing for random perturbations in parameters for exploration (a.la. GA mutations.)

# How Neural Networks generate Visual Art from inspiration

Since my last blog post on Google Translate, I have been reading the earlier articles on Google’s Research Blog. Their work on generative AI particularly caught my eye, where they have tried building models to create art/imagery using deep learning.

Announced back in 2015, DeepDream has fascinated a lot of people with its ability to interpret images in fascinating ways to ‘dream up’ complicated visuals where none exist.

Talking about creating beautiful pictures, we also had apps like Prisma and DeepForger that transformed user-given photos in the manner of famous/standard art-styles to create some stunning work.

In this post, I attempt to give an intuitive explanation for this paper: A Neural Algorithm of Artistic Style by Gatys, Ecker and Bethge. The aim of this work is pretty similar to what Prisma actually does, i.e. combining the content from one image with the artistic style of another to fabricate a new image. On the way we will also get some glimpse into how DeepDream works.

### Convolutional Neural Networks

Before we delve into creation of images, lets get a high-level understanding of how deep learning typically understands them. Convolutional Neural Networks (CNNs) are state-of-the-art when it comes to image analysis. Assuming you know what a basic Neural Network is, heres a simplified depiction of a Convolutional Network:

Layers 1 & 2 are what make CNNs special; the final ‘classifier’ is just a standard fully-connected network.

Both layer 1 and 2 are performing two different operations on the input:

1. Convolution
2. Pooling

In the Convolution step, we compute a set of Feature Maps using the previous layer. A Feature Map typically has the same dimensions as the input ‘image’, but there’s a difference in the way its neurons are connected to the preceding layer. Each one is only connected to a small local area around its position (see image). Whats more, the set of weights that every neuron uses is the same. This set of shared weights is also called a filter.

Intuitively, you can say that each node in the Feature Map is essentially looking for the same concept, but in a limited area. This gives CNNs a very powerful trait: the ability to detect features irrespective of their position in the actual image. Since every neuron is trained to detect the same entity (shared weights), one or the other will fire incase the corresponding object happens to be in the input – irrespective of the exact location. Also worth noting is the fact that neighboring neurons in the Map will analyze partially intersecting portions of the previous layer, so we haven’t really done any hard ‘segmentation’.

In the set of Feature Maps at a particular level, each one looks for their own concept which they learnt during training. As you go higher and higher up the overall layers, these sets of Maps start looking for progressively higher-level objects. The first set (in the lowest layer) might look for lines/circles/curves, the next one might detect shapes of eyes/noses/etc, while the topmost layers will ultimately understand complete faces (an over-simplification, but you get the idea). Something like this:

Pooling – You can think of Pooling as a sort of compression operation. What we basically do is divide each Feature Map into a set of non-overlapping ‘boxes’ and replace each box with a representative based on the values inside it. This representative could either be the maximum value (called Max-Pooling) or the mean (called Average-Pooling). The intuition behind this step is to reduce noise and retain the most interesting parts of the data (or summarize it) to provide to the next layer. It also allows the future layers to analyze larger portions of the image without having to increase filter size.

Typical CNNs used in deep learning have multiple such Convolution + Pooling layers, each caring lesser and lesser about the actual pixel values and more about the general content of the image. Feature Maps at Layer $N+1$ will take inputs from all the compressed/pooled maps from Layer $N$ in a typical scenario. Moreover, the number of Feature Maps at each layer is not a constant, and is usually decided by trial-and-error (as are most design decisions in Machine Learning).

### Recreating the Content of an Image

Neural networks in general have a very handy property: The ability to work in reverse (well, sort-of). Basically, “How do I change the current input so that it yields a certain output?“. Lets see how.

Consider a CNN $C$, trained to recognize animals in input images. Given a genuine photo of a dog, the CNN might be able to classify it correctly by virtue of its convolutional layers and the final classifier. But now suppose I show it an image of just…clouds. Forget the final classifier, the intermediate layers are more interesting here. Since $C$ was originally trained to look for features of animals, that is exactly what it will try to do here! It might interpret random clouds and shapes as animals/parts of animals – a form of artificial pareidolia (the psychological phenomenon of perceiving patterns where none exist).

You can actually visualize what a particular layer of the CNN interprets from the image. Suppose the original cloud-image was $I_c$:

Say at a certain level $l$ of $C$, the Feature Maps gave an output $F_l$ based on $I_c$.

What we will do now, is provide $C$ with a white-noise image $I_n$:

This sort-of works like a blank-slate for $C$, since it has no real information to interpret (though $C$ can still ‘see’ patterns, but very very vaguely). Now, using the process of Gradient Descent, we can make $C$ modify $I_n$ so that it yields an output close to $F_l$ at level $l$.

What it essentially does, is iteratively shift the pixel values of $I_n$ until its output at $l$ is similar to that of $I_c$. One key point: Even after the end of this process, $I_n$ will not really become the same as $I_c$. Think about it – you have recreated $I_c$ based on the CNN’s interpretation of $I_c$, which involves a lot of intermediate convolutions and pooling. The higher the level $l$ you choose for re-creating the image, the deeper the pareidolia based on the CNN’s training – or more ‘abstract’ the interpretations.

In fact, this is pretty similar to what DeepDream does for understanding what a deep CNN has ‘learnt’ from its training. The cloud image I showed earlier was indeed used with a CNN trained to recognize animals, leading to some pretty weird imagery:

Now, the paper we use as reference wants to recreate the content of an image pretty accurately, so how do we avoid such misinterpretation of shapes? The answer lies in the use of a powerful CNN trained to recognize a wide variety of objects – like the one developed by Oxford’s Visual Geometry Group (VGG) – VGGNet. VGGNet is freely available online, pre-trained and ready-made (Tensorflow example).

### Recreating the Style of an Image

In the last section, we saw that the output from Feature Maps at a certain level ($F_l$) could be used as a ‘goal’ to recreate an image with conceptually similar content. But what about style or texture?

Intuitively speaking, the style of an image is not as much about the actual objects in it, but rather the co-occurrence of features/shape in the overall visual (Reference). This idea is quantified by the Gramian matrix with respect to the Feature Maps: $G(F_l)$.

Suppose we have $n$ different Feature Maps at level $l$ of CNN $C$. $G(F_l)$ is a matrix of dimensions $n X n$, with the element at position $[i, j]$ being the inner product between Feature Maps $i$ and $j$. Quoting an answer from this Stack-Exchange question, “the inner product between $x$ and $y$ is indicative of how much of $y$ could be described using $x$“. Essentially, in this case, it quantifies how similar are the trends between the numbers present in Feature Maps $i$ and $j$ (“do triangles and circles occur together in this image?”).

Thus, $G(F_l)$ is used as the Gradient-Descent ‘goal’ instead of $F_l$ while re-creating the artistic style of a photo/image.

The following stack shows style (not content) recreations of the Composition-VII painting by Kandinsky . As you go lower, the images are based on progressively higher/deeper layers of the CNN:

As you will notice, higher layers tend to reproduce more complex and detailed strokes from the original image. This could be attributed to the capture of more high-level details by virtue of feature-extraction and pooling in the Convolutional Network.

### Combining Content and Style from two different Images

That brings us to the final part – combining the above two concepts to achieve something like this:

Gradient Descent always considers a target ‘error function’ to minimize while performing optimization. Given two vectors $x$ and $y$, let this function be denoted by $\Lambda(x, y)$.

Suppose you want to generate an image that has the content of image $I_c$ in the style of image $I_s$. The white-noise image you start out with, is $I_n$. Let $F^{I}$ be the output given by a certain set of feature maps based on image $I$.

Now, if you were only looking to recreate content from $I_c$, you would be minimizing:

$\Lambda(F^{I_n}, F^{I_c})$

If you were only interested in the style from $I_s$, you would minimize:

$\Lambda(G(F^{I_n}), G(F^{I_s}))$

Combining the two, you get a new function for minimizing:

$\alpha*\Lambda(F^{I_n}, F^{I_c}) + \beta*\Lambda(G(F^{I_n}), G(F^{I_s}))$

$\alpha$ and $\beta$ are basically the weightage you give to the content and style respectively.

The tiles shown below depict output from the same convolutional layer, but with higher values of $\alpha / \beta$ as you go to the right:

Pretty cool, isn’t it?

# Predicting Trigonometric Waves few steps ahead with LSTMs in TensorFlow

I have recently been revisiting my study of Deep Learning, and I thought of doing some experiments with Wave prediction using LSTMs. This is nothing new, just more of a log of some tinkering done using TensorFlow.

### The Problem

The basic input to the model is a 2-D vector – each number corresponding to the value attained by the corresponding wave. Each wave in turn is: (a constant + a sine wave + a cosine wave). The waves themselves have different magnitudes, initial phases and frequencies. The goal is to predict the values that will be attained a certain (I chose 23) steps ahead on the curve.

So first off, heres the wave-generation code:


##Producing Training/Testing inputs+output
from numpy import array, sin, cos, pi
from random import random

#Random initial angles
angle1 = random()
angle2 = random()

#The total 2*pi cycle would be divided into 'frequency'
#number of steps
frequency1 = 300
frequency2 = 200
#This defines how many steps ahead we are trying to predict
lag = 23

def get_sample():
"""
Returns a [[sin value, cos value]] input.
"""
global angle1, angle2
angle1 += 2*pi/float(frequency1)
angle2 += 2*pi/float(frequency2)
angle1 %= 2*pi
angle2 %= 2*pi
return array([array([
5 + 5*sin(angle1) + 10*cos(angle2),
7 + 7*sin(angle2) + 14*cos(angle1)])])

sliding_window = []

for i in range(lag - 1):
sliding_window.append(get_sample())

def get_pair():
"""
Returns an (current, later) pair, where 'later' is 'lag'
steps ahead of the 'current' on the wave(s) as defined by the
frequency.
"""

global sliding_window
sliding_window.append(get_sample())
input_value = sliding_window[0]
output_value = sliding_window[-1]
sliding_window = sliding_window[1:]
return input_value, output_value



Essentially, you just need to call get_pair to get an ‘input, output’ pair – the output being 23 time intervals ahead on the curve. Each have the NumPy dimensionality of [1, 2]. The first value ‘1’ means that the batch size is 1 – we will feed one input at a time while training/testing.

Now, I don’t pass the input directly into the LSTM. I try to improve the LSTM’s understanding of the input, by providing its first and second derivatives as well. So, if the input at time t is x(t), the derivative is x'(t) = (x(t) – x(t-1)). Following the analogy, x”(t) = (x'(t) – x'(t-1)). Here’s the code for that:


#Input Params
input_dim = 2

#To maintain state
last_value = array([0 for i in range(input_dim)])
last_derivative = array([0 for i in range(input_dim)])

def get_total_input_output():
"""
Returns the overall Input and Output as required by the model.
The input is a concatenation of the wave values, their first and
second derivatives.
"""
global last_value, last_derivative
raw_i, raw_o = get_pair()
raw_i = raw_i[0]
l1 = list(raw_i)
derivative = raw_i - last_value
l2 = list(derivative)
last_value = raw_i
l3 = list(derivative - last_derivative)
last_derivative = derivative
return array([l1 + l2 + l3]), raw_o



So the overall input to the model becomes a concatenated version of x(t), x'(t), x”(t). The obvious question to ask would be- Why not do this in the TensorFlow Graph itself? I did try it, and for some reason (which I don’t understand yet), there seems to seep in some noise into the Variables that act as memory units to maintain state.

But anyways, here’s the code for that too:


#Imports
import tensorflow as tf
from tensorflow.models.rnn.rnn import *

#Input Params
input_dim = 2

##The Input Layer as a Placeholder
#Since we will provide data sequentially, the 'batch size'
#is 1.
input_layer = tf.placeholder(tf.float32, [1, input_dim])

##First Order Derivative Layer
#This will store the last recorded value
last_value1 = tf.Variable(tf.zeros([1, input_dim]))
#Subtract last value from current
sub_value1 = tf.sub(input_layer, last_value1)
#Update last recorded value
last_assign_op1 = last_value1.assign(input_layer)

##Second Order Derivative Layer
#This will store the last recorded derivative
last_value2 = tf.Variable(tf.zeros([1, input_dim]))
#Subtract last value from current
sub_value2 = tf.sub(sub_value1, last_value2)
#Update last recorded value
last_assign_op2 = last_value2.assign(sub_value1)

##Overall input to the LSTM
#x and its first and second order derivatives as outputs of
#earlier layers
zero_order = last_assign_op1
first_order = last_assign_op2
second_order = sub_value2
#Concatenated
total_input = tf.concat(1, [zero_order, first_order, second_order])



If you have an idea of what might be going wrong, do leave a comment! In any case, the core model follows.

### The Model

So heres the the TensorFlow model:

1) The Imports:


#Imports
import tensorflow as tf
from tensorflow.models.rnn.rnn import *



2) Our input layer, as always, will be a Placeholder instance with the appropriate type and dimensions:


#Input Params
input_dim = 2

##The Input Layer as a Placeholder
#Since we will provide data sequentially, the 'batch size'
#is 1.
input_layer = tf.placeholder(tf.float32, [1, input_dim*3])



3) We then define out LSTM layer. If you are new to Recurrent Neural Networks or LSTMs, here are two excellent resources:

1. This blog post by Christopher Olah
2. This deeplearning.net post. It defines the math behind the LSTM cell pretty succinctly.

If you like to see implementation-level details too, then heres the relevant portion of the TensorFlow source for you.

Now the LSTM layer:


##The LSTM Layer-1
#The LSTM Cell initialization
lstm_layer1 = rnn_cell.BasicLSTMCell(input_dim*3)
#The LSTM state as a Variable initialized to zeroes
lstm_state1 = tf.Variable(tf.zeros([1, lstm_layer1.state_size]))
#Connect the input layer and initial LSTM state to the LSTM cell
lstm_output1, lstm_state_output1 = lstm_layer1(input_layer, lstm_state1,
scope=&quot;LSTM1&quot;)
#The LSTM state will get updated
lstm_update_op1 = lstm_state1.assign(lstm_state_output1)



We only use 1 LSTM layer. Providing a scope to the LSTM layer call (on line 8) helps in avoiding variable-scope conflicts if you have multiple LSTM layers.

The LSTM layer is followed by a simple linear regression layer, whose output becomes the final output.


##The Regression-Output Layer1
#The Weights and Biases matrices first
output_W1 = tf.Variable(tf.truncated_normal([input_dim*3, input_dim]))
output_b1 = tf.Variable(tf.zeros([input_dim]))
#Compute the output
final_output = tf.matmul(lstm_output1, output_W1) + output_b1



We have finished defining the model itself. But now, we need to initialize the training components. These help fine-tune the parameters/state of the model to make it ready for deployment. We won’t be using these components post training (ideally).

4) First, a Placeholder for the correct output associated with the input:


##Input for correct output (for training)
correct_output = tf.placeholder(tf.float32, [1, input_dim])



Then, the error will be computed using the LSTM output and the correct output as the Sum-of-Squares loss.


##Calculate the Sum-of-Squares Error
error = tf.pow(tf.sub(final_output, correct_output), 2)



Finally, we initialize an Optimizer to adjust the weights for the LSTM layer. I tried Gradient Descent, RMSProp as well as Adam Optimization. Adam works best for this model. Gradient Descent works really bad on LSTMs for some reason (that I can’t grasp right now). If you want to read more about Adam-Optimization, read this paper. I decided on the learning rate of 0.0006 after a lot of trial-and-error, and it seems to work best for the number of iterations I use (100k).


##The Optimizer



5) Finally, we initialize the Session and all required Variables as always.


##Session
sess = tf.Session()
#Initialize all Variables
sess.run(tf.initialize_all_variables())



The Training

Here’s the rudimentary code I used for training the model:


##Training

actual_output1 = []
actual_output2 = []
network_output1 = []
network_output2 = []
x_axis = []

for i in range(80000):
input_v, output_v = get_total_input_output()
_, _, network_output = sess.run([lstm_update_op1,
train_step,
final_output],
feed_dict = {
input_layer: input_v,
correct_output: output_v})

actual_output1.append(output_v[0][0])
actual_output2.append(output_v[0][1])
network_output1.append(network_output[0][0])
network_output2.append(network_output[0][1])
x_axis.append(i)

import matplotlib.pyplot as plt
plt.plot(x_axis, network_output1, 'r-', x_axis, actual_output1, 'b-')
plt.show()
plt.plot(x_axis, network_output2, 'r-', x_axis, actual_output2, 'b-')
plt.show()



Training takes almost a minute on my Intel i5 machine.

Consider the first wave. Initially, the network output is far from the correct one(The red one is the LSTM output):

But by the end, it fits pretty well:

Similar trends are seen for the second wave:

### Testing

In practical scenarios, the state at which you end training would rarely be the state at which you deploy. Therefore, prior to testing, I ‘fastforward’ both the waves first. Then, I flush the contents of the LSTM cell (mind you, the learned matrix parameters for the individual functions don’t change).


##Testing

for i in range(200):
get_total_input_output()

#Flush LSTM state
sess.run(lstm_state1.assign(tf.zeros([1, lstm_layer1.state_size])))



And here’s the rest of the testing code:


actual_output1 = []
actual_output2 = []
network_output1 = []
network_output2 = []
x_axis = []

for i in range(1000):
input_v, output_v = get_total_input_output()
_, network_output = sess.run([lstm_update_op1,
final_output],
feed_dict = {
input_layer: input_v,
correct_output: output_v})

actual_output1.append(output_v[0][0])
actual_output2.append(output_v[0][1])
network_output1.append(network_output[0][0])
network_output2.append(network_output[0][1])
x_axis.append(i)

import matplotlib.pyplot as plt
plt.plot(x_axis, network_output1, 'r-', x_axis, actual_output1, 'b-')
plt.show()
plt.plot(x_axis, network_output2, 'r-', x_axis, actual_output2, 'b-')
plt.show()



Its pretty similar to the training one, except for one small difference: I don’t run the training op anymore. Therefore, those components of the Graph don’t work at all.

Here’s the correct output with the model’s output for the first wave:

And the second wave:

Thats all for now! I am not a deep learning expert, and I still experimenting with RNNs, so do leave comments/suggestions if you have any! Cheers!