How Neural Networks generate Visual Art from inspiration

Since my last blog post on Google Translate, I have been reading the earlier articles on Google’s Research Blog. Their work on generative AI particularly caught my eye, where they have tried building models to create art/imagery using deep learning.

Announced back in 2015, DeepDream has fascinated a lot of people with its ability to interpret images in fascinating ways to ‘dream up’ complicated visuals where none exist.

fefe

Talking about creating beautiful pictures, we also had apps like Prisma and DeepForger that transformed user-given photos in the manner of famous/standard art-styles to create some stunning work.

picasso-796x398

In this post, I attempt to give an intuitive explanation for this paper: A Neural Algorithm of Artistic Style by Gatys, Ecker and Bethge. The aim of this work is pretty similar to what Prisma actually does, i.e. combining the content from one image with the artistic style of another to fabricate a new image. On the way we will also get some glimpse into how DeepDream works.

Convolutional Neural Networks

Before we delve into creation of images, lets get a high-level understanding of how deep learning typically understands them. Convolutional Neural Networks (CNNs) are state-of-the-art when it comes to image analysis. Assuming you know what a basic Neural Network is, heres a simplified depiction of a Convolutional Network:

screen-shot-2017-02-02-at-6-47-34-pm

Layers 1 & 2 are what make CNNs special; the final ‘classifier’ is just a standard fully-connected network.

Both layer 1 and 2 are performing two different operations on the input:

  1. Convolution
  2. Pooling

In the Convolution step, we compute a set of Feature Maps using the previous layer. A Feature Map typically has the same dimensions as the input ‘image’, but there’s a difference in the way its neurons are connected to the preceding layer. Each one is only connected to a small local area around its position (see image). Whats more, the set of weights that every neuron uses is the same. This set of shared weights is also called a filter.

main-qimg-0b66c1de5925f47e97bd2c26d99dbc3e

Intuitively, you can say that each node in the Feature Map is essentially looking for the same concept, but in a limited area. This gives CNNs a very powerful trait: the ability to detect features irrespective of their position in the actual image. Since every neuron is trained to detect the same entity (shared weights), one or the other will fire incase the corresponding object happens to be in the input – irrespective of the exact location. Also worth noting is the fact that neighboring neurons in the Map will analyze partially intersecting portions of the previous layer, so we haven’t really done any hard ‘segmentation’.

In the set of Feature Maps at a particular level, each one looks for their own concept which they learnt during training. As you go higher and higher up the overall layers, these sets of Maps start looking for progressively higher-level objects. The first set (in the lowest layer) might look for lines/circles/curves, the next one might detect shapes of eyes/noses/etc, while the topmost layers will ultimately understand complete faces (an over-simplification, but you get the idea). Something like this:

screen-shot-2017-02-02-at-6-47-46-pm

Pooling – You can think of Pooling as a sort of compression operation. What we basically do is divide each Feature Map into a set of non-overlapping ‘boxes’ and replace each box with a representative based on the values inside it. This representative could either be the maximum value (called Max-Pooling) or the mean (called Average-Pooling). The intuition behind this step is to reduce noise and retain the most interesting parts of the data (or summarize it) to provide to the next layer. It also allows the future layers to analyze larger portions of the image without having to increase filter size.

figure7

Typical CNNs used in deep learning have multiple such Convolution + Pooling layers, each caring lesser and lesser about the actual pixel values and more about the general content of the image. Feature Maps at Layer N+1 will take inputs from all the compressed/pooled maps from Layer N in a typical scenario. Moreover, the number of Feature Maps at each layer is not a constant, and is usually decided by trial-and-error (as are most design decisions in Machine Learning).

Recreating the Content of an Image

Neural networks in general have a very handy property: The ability to work in reverse (well, sort-of). Basically, “How do I change the current input so that it yields a certain output?“. Lets see how.

Consider a CNN C, trained to recognize animals in input images. Given a genuine photo of a dog, the CNN might be able to classify it correctly by virtue of its convolutional layers and the final classifier. But now suppose I show it an image of just…clouds. Forget the final classifier, the intermediate layers are more interesting here. Since C was originally trained to look for features of animals, that is exactly what it will try to do here! It might interpret random clouds and shapes as animals/parts of animals – a form of artificial pareidolia (the psychological phenomenon of perceiving patterns where none exist).

You can actually visualize what a particular layer of the CNN interprets from the image. Suppose the original cloud-image was I_c:

screen-shot-2017-02-03-at-12-32-59-pm

Say at a certain level l of C, the Feature Maps gave an output  F_l based on I_c.

What we will do now, is provide C with a white-noise image I_n:

screen-shot-2017-02-03-at-12-38-19-pm

This sort-of works like a blank-slate for C, since it has no real information to interpret (though C can still ‘see’ patterns, but very very vaguely). Now, using the process of Gradient Descent, we can make C modify I_n so that it yields an output close to F_l at level l.

What it essentially does, is iteratively shift the pixel values of I_n until its output at l is similar to that of I_c. One key point: Even after the end of this process, I_n will not really become the same as I_c. Think about it – you have recreated I_c based on the CNN’s interpretation of I_c, which involves a lot of intermediate convolutions and pooling. The higher the level l you choose for re-creating the image, the deeper the pareidolia based on the CNN’s training – or more ‘abstract’ the interpretations.

In fact, this is pretty similar to what DeepDream does for understanding what a deep CNN has ‘learnt’ from its training. The cloud image I showed earlier was indeed used with a CNN trained to recognize animals, leading to some pretty weird imagery:

screen-shot-2017-02-03-at-1-44-05-pm

screen-shot-2017-02-03-at-1-44-11-pm

Now, the paper we use as reference wants to recreate the content of an image pretty accurately, so how do we avoid such misinterpretation of shapes? The answer lies in the use of a powerful CNN trained to recognize a wide variety of objects – like the one developed by Oxford’s Visual Geometry Group (VGG) – VGGNet. VGGNet is freely available online, pre-trained and ready-made (Tensorflow example).

Recreating the Style of an Image

In the last section, we saw that the output from Feature Maps at a certain level (F_l) could be used as a ‘goal’ to recreate an image with conceptually similar content. But what about style or texture?

Intuitively speaking, the style of an image is not as much about the actual objects in it, but rather the co-occurrence of features/shape in the overall visual (Reference). This idea is quantified by the Gramian matrix with respect to the Feature Maps: G(F_l).

Suppose we have n different Feature Maps at level l of CNN C. G(F_l) is a matrix of dimensions n X n, with the element at position [i, j] being the inner product between Feature Maps i and j. Quoting an answer from this Stack-Exchange question, “the inner product between x and y is indicative of how much of y could be described using x“. Essentially, in this case, it quantifies how similar are the trends between the numbers present in Feature Maps i and j (“do triangles and circles occur together in this image?”).

Thus, G(F_l) is used as the Gradient-Descent ‘goal’ instead of F_l while re-creating the artistic style of a photo/image.

The following stack shows style (not content) recreations of the Composition-VII painting by Kandinsky . As you go lower, the images are based on progressively higher/deeper layers of the CNN:

screen-shot-2017-02-03-at-3-13-47-pm

As you will notice, higher layers tend to reproduce more complex and detailed strokes from the original image. This could be attributed to the capture of more high-level details by virtue of feature-extraction and pooling in the Convolutional Network.

Combining Content and Style from two different Images

That brings us to the final part – combining the above two concepts to achieve something like this:

screen-shot-2017-02-03-at-4-27-41-pm

Gradient Descent always considers a target ‘error function’ to minimize while performing optimization. Given two vectors x and y, let this function be denoted by \Lambda(x, y).

Suppose you want to generate an image that has the content of image I_c in the style of image I_s. The white-noise image you start out with, is I_n. Let F^{I} be the output given by a certain set of feature maps based on image I.

Now, if you were only looking to recreate content from I_c, you would be minimizing:

\Lambda(F^{I_n}, F^{I_c})

If you were only interested in the style from I_s, you would minimize:

\Lambda(G(F^{I_n}), G(F^{I_s}))

Combining the two, you get a new function for minimizing:

\alpha*\Lambda(F^{I_n}, F^{I_c}) + \beta*\Lambda(G(F^{I_n}), G(F^{I_s}))

\alpha and \beta are basically the weightage you give to the content and style respectively.

The tiles shown below depict output from the same convolutional layer, but with higher values of \alpha / \beta as you go to the right:

screen-shot-2017-02-03-at-4-42-33-pm

Pretty cool, isn’t it?

Advertisements

Predicting Trigonometric Waves few steps ahead with LSTMs in TensorFlow

I have recently been revisiting my study of Deep Learning, and I thought of doing some experiments with Wave prediction using LSTMs. This is nothing new, just more of a log of some tinkering done using TensorFlow.

The Problem

The basic input to the model is a 2-D vector – each number corresponding to the value attained by the corresponding wave. Each wave in turn is: (a constant + a sine wave + a cosine wave). The waves themselves have different magnitudes, initial phases and frequencies. The goal is to predict the values that will be attained a certain (I chose 23) steps ahead on the curve.

So first off, heres the wave-generation code:


##Producing Training/Testing inputs+output
from numpy import array, sin, cos, pi
from random import random

#Random initial angles
angle1 = random()
angle2 = random()

#The total 2*pi cycle would be divided into 'frequency'
#number of steps
frequency1 = 300
frequency2 = 200
#This defines how many steps ahead we are trying to predict
lag = 23

def get_sample():
    """
    Returns a [[sin value, cos value]] input.
    """
    global angle1, angle2
    angle1 += 2*pi/float(frequency1)
    angle2 += 2*pi/float(frequency2)
    angle1 %= 2*pi
    angle2 %= 2*pi
    return array([array([
        5 + 5*sin(angle1) + 10*cos(angle2),
        7 + 7*sin(angle2) + 14*cos(angle1)])])


sliding_window = []

for i in range(lag - 1):
    sliding_window.append(get_sample())


def get_pair():
    """
    Returns an (current, later) pair, where 'later' is 'lag'
    steps ahead of the 'current' on the wave(s) as defined by the
    frequency.
    """

    global sliding_window
    sliding_window.append(get_sample())
    input_value = sliding_window[0]
    output_value = sliding_window[-1]
    sliding_window = sliding_window[1:]
    return input_value, output_value

Essentially, you just need to call get_pair to get an ‘input, output’ pair – the output being 23 time intervals ahead on the curve. Each have the NumPy dimensionality of [1, 2]. The first value ‘1’ means that the batch size is 1 – we will feed one input at a time while training/testing.

Now, I don’t pass the input directly into the LSTM. I try to improve the LSTM’s understanding of the input, by providing its first and second derivatives as well. So, if the input at time t is x(t), the derivative is x'(t) = (x(t) – x(t-1)). Following the analogy, x”(t) = (x'(t) – x'(t-1)). Here’s the code for that:


#Input Params
input_dim = 2

#To maintain state
last_value = array([0 for i in range(input_dim)])
last_derivative = array([0 for i in range(input_dim)])


def get_total_input_output():
    """
    Returns the overall Input and Output as required by the model.
    The input is a concatenation of the wave values, their first and
    second derivatives.
    """
    global last_value, last_derivative
    raw_i, raw_o = get_pair()
    raw_i = raw_i[0]
    l1 = list(raw_i)
    derivative = raw_i - last_value
    l2 = list(derivative)
    last_value = raw_i
    l3 = list(derivative - last_derivative)
    last_derivative = derivative
    return array([l1 + l2 + l3]), raw_o

So the overall input to the model becomes a concatenated version of x(t), x'(t), x”(t). The obvious question to ask would be- Why not do this in the TensorFlow Graph itself? I did try it, and for some reason (which I don’t understand yet), there seems to seep in some noise into the Variables that act as memory units to maintain state.

But anyways, here’s the code for that too:


#Imports
import tensorflow as tf
from tensorflow.models.rnn.rnn import *

#Input Params
input_dim = 2


##The Input Layer as a Placeholder
#Since we will provide data sequentially, the 'batch size'
#is 1.
input_layer = tf.placeholder(tf.float32, [1, input_dim])

##First Order Derivative Layer
#This will store the last recorded value
last_value1 = tf.Variable(tf.zeros([1, input_dim]))
#Subtract last value from current
sub_value1 = tf.sub(input_layer, last_value1)
#Update last recorded value
last_assign_op1 = last_value1.assign(input_layer)

##Second Order Derivative Layer
#This will store the last recorded derivative
last_value2 = tf.Variable(tf.zeros([1, input_dim]))
#Subtract last value from current
sub_value2 = tf.sub(sub_value1, last_value2)
#Update last recorded value
last_assign_op2 = last_value2.assign(sub_value1)

##Overall input to the LSTM
#x and its first and second order derivatives as outputs of
#earlier layers
zero_order = last_assign_op1
first_order = last_assign_op2
second_order = sub_value2
#Concatenated
total_input = tf.concat(1, [zero_order, first_order, second_order])

If you have an idea of what might be going wrong, do leave a comment! In any case, the core model follows.

 

The Model

So heres the the TensorFlow model:

1) The Imports:


#Imports
import tensorflow as tf
from tensorflow.models.rnn.rnn import *

 

2) Our input layer, as always, will be a Placeholder instance with the appropriate type and dimensions:


#Input Params
input_dim = 2

##The Input Layer as a Placeholder
#Since we will provide data sequentially, the 'batch size'
#is 1.
input_layer = tf.placeholder(tf.float32, [1, input_dim*3])

 

3) We then define out LSTM layer. If you are new to Recurrent Neural Networks or LSTMs, here are two excellent resources:

  1. This blog post by Christopher Olah
  2. This deeplearning.net post. It defines the math behind the LSTM cell pretty succinctly.

If you like to see implementation-level details too, then heres the relevant portion of the TensorFlow source for you.

Now the LSTM layer:


##The LSTM Layer-1
#The LSTM Cell initialization
lstm_layer1 = rnn_cell.BasicLSTMCell(input_dim*3)
#The LSTM state as a Variable initialized to zeroes
lstm_state1 = tf.Variable(tf.zeros([1, lstm_layer1.state_size]))
#Connect the input layer and initial LSTM state to the LSTM cell
lstm_output1, lstm_state_output1 = lstm_layer1(input_layer, lstm_state1,
                                              scope="LSTM1")
#The LSTM state will get updated
lstm_update_op1 = lstm_state1.assign(lstm_state_output1)

We only use 1 LSTM layer. Providing a scope to the LSTM layer call (on line 8) helps in avoiding variable-scope conflicts if you have multiple LSTM layers.

The LSTM layer is followed by a simple linear regression layer, whose output becomes the final output.


##The Regression-Output Layer1
#The Weights and Biases matrices first
output_W1 = tf.Variable(tf.truncated_normal([input_dim*3, input_dim]))
output_b1 = tf.Variable(tf.zeros([input_dim]))
#Compute the output
final_output = tf.matmul(lstm_output1, output_W1) + output_b1

 

We have finished defining the model itself. But now, we need to initialize the training components. These help fine-tune the parameters/state of the model to make it ready for deployment. We won’t be using these components post training (ideally).

 

4) First, a Placeholder for the correct output associated with the input:


##Input for correct output (for training)
correct_output = tf.placeholder(tf.float32, [1, input_dim])

Then, the error will be computed using the LSTM output and the correct output as the Sum-of-Squares loss.


##Calculate the Sum-of-Squares Error
error = tf.pow(tf.sub(final_output, correct_output), 2)

Finally, we initialize an Optimizer to adjust the weights for the LSTM layer. I tried Gradient Descent, RMSProp as well as Adam Optimization. Adam works best for this model. Gradient Descent works really bad on LSTMs for some reason (that I can’t grasp right now). If you want to read more about Adam-Optimization, read this paper. I decided on the learning rate of 0.0006 after a lot of trial-and-error, and it seems to work best for the number of iterations I use (100k).


##The Optimizer
#Adam works best
train_step = tf.train.AdamOptimizer(0.0006).minimize(error)

 

5) Finally, we initialize the Session and all required Variables as always.


##Session
sess = tf.Session()
#Initialize all Variables
sess.run(tf.initialize_all_variables())

The Training

Here’s the rudimentary code I used for training the model:


##Training

actual_output1 = []
actual_output2 = []
network_output1 = []
network_output2 = []
x_axis = []


for i in range(80000):
    input_v, output_v = get_total_input_output()
    _, _, network_output = sess.run([lstm_update_op1,
                                     train_step,
                                     final_output],
                                    feed_dict = {
                                        input_layer: input_v,
                                        correct_output: output_v})

    actual_output1.append(output_v[0][0])
    actual_output2.append(output_v[0][1])
    network_output1.append(network_output[0][0])
    network_output2.append(network_output[0][1])
    x_axis.append(i)

import matplotlib.pyplot as plt
plt.plot(x_axis, network_output1, 'r-', x_axis, actual_output1, 'b-')
plt.show()
plt.plot(x_axis, network_output2, 'r-', x_axis, actual_output2, 'b-')
plt.show()

Training takes almost a minute on my Intel i5 machine.

Consider the first wave. Initially, the network output is far from the correct one(The red one is the LSTM output):

w1_init

But by the end, it fits pretty well:

w1_final

Similar trends are seen for the second wave:

w2_overall

Testing

In practical scenarios, the state at which you end training would rarely be the state at which you deploy. Therefore, prior to testing, I ‘fastforward’ both the waves first. Then, I flush the contents of the LSTM cell (mind you, the learned matrix parameters for the individual functions don’t change).


##Testing

for i in range(200):
    get_total_input_output()

#Flush LSTM state
sess.run(lstm_state1.assign(tf.zeros([1, lstm_layer1.state_size])))

And here’s the rest of the testing code:


actual_output1 = []
actual_output2 = []
network_output1 = []
network_output2 = []
x_axis = []


for i in range(1000):
    input_v, output_v = get_total_input_output()
    _, network_output = sess.run([lstm_update_op1,
                                  final_output],
                                 feed_dict = {
                                     input_layer: input_v,
                                     correct_output: output_v})

    actual_output1.append(output_v[0][0])
    actual_output2.append(output_v[0][1])
    network_output1.append(network_output[0][0])
    network_output2.append(network_output[0][1])
    x_axis.append(i)

import matplotlib.pyplot as plt
plt.plot(x_axis, network_output1, 'r-', x_axis, actual_output1, 'b-')
plt.show()
plt.plot(x_axis, network_output2, 'r-', x_axis, actual_output2, 'b-')
plt.show()

Its pretty similar to the training one, except for one small difference: I don’t run the training op anymore. Therefore, those components of the Graph don’t work at all.

Here’s the correct output with the model’s output for the first wave:

w1_test

And the second wave:w2_test

Thats all for now! I am not a deep learning expert, and I still experimenting with RNNs, so do leave comments/suggestions if you have any! Cheers!