Since my last blog post on Google Translate, I have been reading the earlier articles on Google’s Research Blog. Their work on generative AI particularly caught my eye, where they have tried building models to create art/imagery using deep learning.
In this post, I attempt to give an intuitive explanation for this paper: A Neural Algorithm of Artistic Style by Gatys, Ecker and Bethge. The aim of this work is pretty similar to what Prisma actually does, i.e. combining the content from one image with the artistic style of another to fabricate a new image. On the way we will also get some glimpse into how DeepDream works.
Convolutional Neural Networks
Before we delve into creation of images, lets get a high-level understanding of how deep learning typically understands them. Convolutional Neural Networks (CNNs) are state-of-the-art when it comes to image analysis. Assuming you know what a basic Neural Network is, heres a simplified depiction of a Convolutional Network:
Layers 1 & 2 are what make CNNs special; the final ‘classifier’ is just a standard fully-connected network.
Both layer 1 and 2 are performing two different operations on the input:
In the Convolution step, we compute a set of Feature Maps using the previous layer. A Feature Map typically has the same dimensions as the input ‘image’, but there’s a difference in the way its neurons are connected to the preceding layer. Each one is only connected to a small local area around its position (see image). Whats more, the set of weights that every neuron uses is the same. This set of shared weights is also called a filter.
Intuitively, you can say that each node in the Feature Map is essentially looking for the same concept, but in a limited area. This gives CNNs a very powerful trait: the ability to detect features irrespective of their position in the actual image. Since every neuron is trained to detect the same entity (shared weights), one or the other will fire incase the corresponding object happens to be in the input – irrespective of the exact location. Also worth noting is the fact that neighboring neurons in the Map will analyze partially intersecting portions of the previous layer, so we haven’t really done any hard ‘segmentation’.
In the set of Feature Maps at a particular level, each one looks for their own concept which they learnt during training. As you go higher and higher up the overall layers, these sets of Maps start looking for progressively higher-level objects. The first set (in the lowest layer) might look for lines/circles/curves, the next one might detect shapes of eyes/noses/etc, while the topmost layers will ultimately understand complete faces (an over-simplification, but you get the idea). Something like this:
Pooling – You can think of Pooling as a sort of compression operation. What we basically do is divide each Feature Map into a set of non-overlapping ‘boxes’ and replace each box with a representative based on the values inside it. This representative could either be the maximum value (called Max-Pooling) or the mean (called Average-Pooling). The intuition behind this step is to reduce noise and retain the most interesting parts of the data (or summarize it) to provide to the next layer. It also allows the future layers to analyze larger portions of the image without having to increase filter size.
Typical CNNs used in deep learning have multiple such Convolution + Pooling layers, each caring lesser and lesser about the actual pixel values and more about the general content of the image. Feature Maps at Layer will take inputs from all the compressed/pooled maps from Layer in a typical scenario. Moreover, the number of Feature Maps at each layer is not a constant, and is usually decided by trial-and-error (as are most design decisions in Machine Learning).
Recreating the Content of an Image
Neural networks in general have a very handy property: The ability to work in reverse (well, sort-of). Basically, “How do I change the current input so that it yields a certain output?“. Lets see how.
Consider a CNN , trained to recognize animals in input images. Given a genuine photo of a dog, the CNN might be able to classify it correctly by virtue of its convolutional layers and the final classifier. But now suppose I show it an image of just…clouds. Forget the final classifier, the intermediate layers are more interesting here. Since was originally trained to look for features of animals, that is exactly what it will try to do here! It might interpret random clouds and shapes as animals/parts of animals – a form of artificial pareidolia (the psychological phenomenon of perceiving patterns where none exist).
You can actually visualize what a particular layer of the CNN interprets from the image. Suppose the original cloud-image was :
Say at a certain level of , the Feature Maps gave an output based on .
What we will do now, is provide with a white-noise image :
This sort-of works like a blank-slate for , since it has no real information to interpret (though can still ‘see’ patterns, but very very vaguely). Now, using the process of Gradient Descent, we can make modify so that it yields an output close to at level .
What it essentially does, is iteratively shift the pixel values of until its output at is similar to that of . One key point: Even after the end of this process, will not really become the same as . Think about it – you have recreated based on the CNN’s interpretation of , which involves a lot of intermediate convolutions and pooling. The higher the level you choose for re-creating the image, the deeper the pareidolia based on the CNN’s training – or more ‘abstract’ the interpretations.
In fact, this is pretty similar to what DeepDream does for understanding what a deep CNN has ‘learnt’ from its training. The cloud image I showed earlier was indeed used with a CNN trained to recognize animals, leading to some pretty weird imagery:
Now, the paper we use as reference wants to recreate the content of an image pretty accurately, so how do we avoid such misinterpretation of shapes? The answer lies in the use of a powerful CNN trained to recognize a wide variety of objects – like the one developed by Oxford’s Visual Geometry Group (VGG) – VGGNet. VGGNet is freely available online, pre-trained and ready-made (Tensorflow example).
Recreating the Style of an Image
In the last section, we saw that the output from Feature Maps at a certain level () could be used as a ‘goal’ to recreate an image with conceptually similar content. But what about style or texture?
Intuitively speaking, the style of an image is not as much about the actual objects in it, but rather the co-occurrence of features/shape in the overall visual (Reference). This idea is quantified by the Gramian matrix with respect to the Feature Maps: .
Suppose we have different Feature Maps at level of CNN . is a matrix of dimensions , with the element at position being the inner product between Feature Maps and . Quoting an answer from this Stack-Exchange question, “the inner product between and is indicative of how much of could be described using “. Essentially, in this case, it quantifies how similar are the trends between the numbers present in Feature Maps and (“do triangles and circles occur together in this image?”).
Thus, is used as the Gradient-Descent ‘goal’ instead of while re-creating the artistic style of a photo/image.
As you will notice, higher layers tend to reproduce more complex and detailed strokes from the original image. This could be attributed to the capture of more high-level details by virtue of feature-extraction and pooling in the Convolutional Network.
Combining Content and Style from two different Images
That brings us to the final part – combining the above two concepts to achieve something like this:
Gradient Descent always considers a target ‘error function’ to minimize while performing optimization. Given two vectors and , let this function be denoted by .
Suppose you want to generate an image that has the content of image in the style of image . The white-noise image you start out with, is . Let be the output given by a certain set of feature maps based on image .
Now, if you were only looking to recreate content from , you would be minimizing:
If you were only interested in the style from , you would minimize:
Combining the two, you get a new function for minimizing:
and are basically the weightage you give to the content and style respectively.
The tiles shown below depict output from the same convolutional layer, but with higher values of as you go to the right:
Pretty cool, isn’t it?