Residual Neural Networks as Ensembles

In a previous blog post, I had mentioned Residual Connections in the context of Google-Neural Machine Translation. I was not completely familiar with the intuition behind Residual Networks then, so heres a short post on what I gathered after reading some literature.

Residual Networks first got attention after this paper – “Deep Residual Learning for Image Recognition” by some folks from Microsoft Research. They used what they called residual connections in Convolutional Neural Networks, obtaining 1st-place results at ILSVRC 2015. However, it turns out that their notion of why the networks achieved such great performance might have been flawed.

The original authors thought that the new connections allowed training of very deep neural networks (their state-of-the-art model had 152 layers) by ‘preserving’ gradient across layers during training. This, according to them, solved the Vanishing Gradient Problem. However, the paper “Residual Networks Behave Like Ensembles of Relatively Shallow Networks” challenges this, attempting to prove that Residual NNs actually behave like ensembles of neural networks!

Lets take this step-by-step.

First, lets establish some nomenclature. In a classic neural network, suppose the output from the nth layer is written as y_n. Mathematically, you can write:

y_n = f_n(y_{n-1}) …(1)

Here, f_n is a vectorial function that computes y_n from y_{n-1}. In a standard neural network, the mathematical computation performed by f is same across all layers (like the ReLU). What differentiates any f_n from other layers is the set of weights assigned to every element of y_{n-1} towards computing elements of y_n. These weights are of course learnt during training.

A typical layer in a Residual NN looks like this (taken from the original paper):

screen-shot-2017-02-16-at-4-53-01-pm

Based on our nomenclature, the expression for y_n now becomes:

y_n = f_n(y_{n-1}) + y_{n-1} …(2)

The difference, as you must have noticed, lies in the second term on the Right-Hand-Side. This term is what they call the identity skip(or short-cut) connection. ‘Skip‘, because you are skipping the f_n computation, and ‘identity‘ because you are not multiplying y_{n-1} by any set of weights before addition.

These identity short-cut connections are what make residual NNs special.

‘Unravelling’ a Residual Neural Network

The reference paper‘s biggest insight lies in the way they look at equation (2) – that is, as a recursive definition. They call this unravelling the NN.

Lets consider a simple residual NN with three layers. The input can be denoted as y_0, with the outputs from the three subsequent layers being y_1y_2 and y_3 respectively. y_3 is the output of the NN as a whole.

screen-shot-2017-02-17-at-4-35-49-pm

Expanding equation (2) step-by-step, we would get

y_3 = f_3(y_2) + y_2 …(3)

y_3 = f_3(f_2(y_1) + y_1) + f_2(y_1) + y_1 …(4)

y_3 = f_3(f_2(f_1(y_0) + y_0) + f_1(y_0) + y_0) + f_2(f_1(y_0) + y_0) + f_1(y_0) + y_0 …(5)

Equation (5) essentially defines y_3 as the sum of outputs from every single layer (including y_0) of the neural network.

Contrast this with what we would see if it was a standard network:

y_3 = f_3(f_2(f_1(y_0))) …(6)

(Ofcourse, the weights learnt would be different in each case)

Consider a situation where y_0 and the weights for f_2 are such that f_2(y_1) turns out to be a vector of very low values.

If this happens in equation (6), you see that there is very little f_3 can do. Since there is only one term on the RHS, any input basically follows one path through the NN.

But now consider equation (5). If f_2(y_1) is close to a zero vector, we still get

y_3 = f_3(f_1(y_0) + y_0) + f_1(y_0) + y_0

It looks like the input signal bypassed the second layer completely on its way to the output! In fact, if we did not have f_2 at all, you would derive the above expression for the network. This flexibility (i.e. the ability to bypass any layer) is provided to the NN by the skip-connections.

And the ability to ‘skip’ is not just restricted to one layer per run- depending on the input and the weights at every level, any combination of them could be on/off for a given case. The name residual can be interpreted better now – residual can mean ‘unused’, which is pretty much what f_2 is in the above example! That is not the case for equation (6), since the network does not have the choice to ignore f_2‘s output in the single term on the RHS.

Diagrammatically, you could ‘unravel’ the network like this:

screen-shot-2017-02-19-at-6-32-41-pm

Each ‘path’ in the above figure denotes some combination of the layers working on y_0 as it makes its way to the output. Every node denotes convergence (addition) of a set of paths, before passing on the sum to the next layer.

Ideally, every output would get significant contributions from every one of those paths. But that is not usually the case, as observed by the authors of the paper.

Now lets see what happens to the figure the moment you switch off f_2, cutting off all corresponding paths:

screen-shot-2017-02-19-at-6-42-16-pm

As you can see, even though f_2 makes no contributions, the input still makes its way through the remaining paths to generate output.

Intuitively, you can see why the authors think of this structure as an ensemble. At every single node (including the last one), there are multiple paths, any of which may or may not contribute to the overall output for a given scenario. You can say that every y_0 activates a unique set of paths in a residual NN.

And how many such paths are there? Its equivalent to the total cases resulting from every one of the layers being either on or off. Simple probability theory will tell you that its 2^N, where N is the total number of layers in the NN.

Unusual properties of Residual NNs

Thinking of residual NNs as ensembles motivated the authors to conduct some experiments to test their hypothesis. Essentially, what they tried to do is see which properties of ensembles residual NNs satisfy:

1. Resilience to layer deletion: In most Neural Networks (the ones that resemble equation (2)), deleting a layer has disastrous results on the outputs – whatever the size of the overall network may be. And for good reason, as you are disrupting the one term that corresponds to the output.

But that is not the case for Residual Neural Networks. In fact, deleting 1 (or even 2 or 3) layers in large residual NNs introduces only around 6-7% of an error into the performance of the network! Moreover, deleting more and more layers actually has a pretty smooth (as in mathematically smooth) effect on the total error:

screen-shot-2017-02-19-at-9-59-23-pm

This is pretty close to how an ensemble behaves – deleting models from an ensemble does introduce error, but the increase is never drastic with respect to the number of models removed.

This can be explained easily by looking at equation (5). Even if you delete one layer from a residual NN, that still leaves 2^{N-1} terms on the RHS of the output. A bunch of them even yield the same result as they would have, with the layer in.

2. Shortening of effective paths: This is actually contradictory to what people first believed about residual NNs – that they promote deeper networks.

During training, the authors observed that the updates were not happening uniformly across all layers, as they would for normal NNs. Every training point would adjust the weights along a specific set of ‘paths’ as shown in the unravelled diagram. And most of these paths, even with 152-layer deep networks, were only 20-30 levels deep!

Even on-line, every input activated only a specific set of paths without taking significant contributions from every single layer and path in the network.

This is where the biggest revelation lies: Residual NNs work better not by increasing the effective paths, but actually reducing them! What works here is that every input has a chance to take its own unique set of paths to the output, without having to go through every single layer.

This is again similar to how ensembles are trained and run. During training, you won’t observe all smaller models in an ensemble getting significantly adjusted for every training point. On-line, there will always only be a subset of models that give a strong output for any input.

 

Thats it for now! Do read the reference paper if you feel interested, or go through the original paper to see their usage in the image processing scenario.

How Neural Networks generate Visual Art from inspiration

Since my last blog post on Google Translate, I have been reading the earlier articles on Google’s Research Blog. Their work on generative AI particularly caught my eye, where they have tried building models to create art/imagery using deep learning.

Announced back in 2015, DeepDream has fascinated a lot of people with its ability to interpret images in fascinating ways to ‘dream up’ complicated visuals where none exist.

fefe

Talking about creating beautiful pictures, we also had apps like Prisma and DeepForger that transformed user-given photos in the manner of famous/standard art-styles to create some stunning work.

picasso-796x398

In this post, I attempt to give an intuitive explanation for this paper: A Neural Algorithm of Artistic Style by Gatys, Ecker and Bethge. The aim of this work is pretty similar to what Prisma actually does, i.e. combining the content from one image with the artistic style of another to fabricate a new image. On the way we will also get some glimpse into how DeepDream works.

Convolutional Neural Networks

Before we delve into creation of images, lets get a high-level understanding of how deep learning typically understands them. Convolutional Neural Networks (CNNs) are state-of-the-art when it comes to image analysis. Assuming you know what a basic Neural Network is, heres a simplified depiction of a Convolutional Network:

screen-shot-2017-02-02-at-6-47-34-pm

Layers 1 & 2 are what make CNNs special; the final ‘classifier’ is just a standard fully-connected network.

Both layer 1 and 2 are performing two different operations on the input:

  1. Convolution
  2. Pooling

In the Convolution step, we compute a set of Feature Maps using the previous layer. A Feature Map typically has the same dimensions as the input ‘image’, but there’s a difference in the way its neurons are connected to the preceding layer. Each one is only connected to a small local area around its position (see image). Whats more, the set of weights that every neuron uses is the same. This set of shared weights is also called a filter.

main-qimg-0b66c1de5925f47e97bd2c26d99dbc3e

Intuitively, you can say that each node in the Feature Map is essentially looking for the same concept, but in a limited area. This gives CNNs a very powerful trait: the ability to detect features irrespective of their position in the actual image. Since every neuron is trained to detect the same entity (shared weights), one or the other will fire incase the corresponding object happens to be in the input – irrespective of the exact location. Also worth noting is the fact that neighboring neurons in the Map will analyze partially intersecting portions of the previous layer, so we haven’t really done any hard ‘segmentation’.

In the set of Feature Maps at a particular level, each one looks for their own concept which they learnt during training. As you go higher and higher up the overall layers, these sets of Maps start looking for progressively higher-level objects. The first set (in the lowest layer) might look for lines/circles/curves, the next one might detect shapes of eyes/noses/etc, while the topmost layers will ultimately understand complete faces (an over-simplification, but you get the idea). Something like this:

screen-shot-2017-02-02-at-6-47-46-pm

Pooling – You can think of Pooling as a sort of compression operation. What we basically do is divide each Feature Map into a set of non-overlapping ‘boxes’ and replace each box with a representative based on the values inside it. This representative could either be the maximum value (called Max-Pooling) or the mean (called Average-Pooling). The intuition behind this step is to reduce noise and retain the most interesting parts of the data (or summarize it) to provide to the next layer. It also allows the future layers to analyze larger portions of the image without having to increase filter size.

figure7

Typical CNNs used in deep learning have multiple such Convolution + Pooling layers, each caring lesser and lesser about the actual pixel values and more about the general content of the image. Feature Maps at Layer N+1 will take inputs from all the compressed/pooled maps from Layer N in a typical scenario. Moreover, the number of Feature Maps at each layer is not a constant, and is usually decided by trial-and-error (as are most design decisions in Machine Learning).

Recreating the Content of an Image

Neural networks in general have a very handy property: The ability to work in reverse (well, sort-of). Basically, “How do I change the current input so that it yields a certain output?“. Lets see how.

Consider a CNN C, trained to recognize animals in input images. Given a genuine photo of a dog, the CNN might be able to classify it correctly by virtue of its convolutional layers and the final classifier. But now suppose I show it an image of just…clouds. Forget the final classifier, the intermediate layers are more interesting here. Since C was originally trained to look for features of animals, that is exactly what it will try to do here! It might interpret random clouds and shapes as animals/parts of animals – a form of artificial pareidolia (the psychological phenomenon of perceiving patterns where none exist).

You can actually visualize what a particular layer of the CNN interprets from the image. Suppose the original cloud-image was I_c:

screen-shot-2017-02-03-at-12-32-59-pm

Say at a certain level l of C, the Feature Maps gave an output  F_l based on I_c.

What we will do now, is provide C with a white-noise image I_n:

screen-shot-2017-02-03-at-12-38-19-pm

This sort-of works like a blank-slate for C, since it has no real information to interpret (though C can still ‘see’ patterns, but very very vaguely). Now, using the process of Gradient Descent, we can make C modify I_n so that it yields an output close to F_l at level l.

What it essentially does, is iteratively shift the pixel values of I_n until its output at l is similar to that of I_c. One key point: Even after the end of this process, I_n will not really become the same as I_c. Think about it – you have recreated I_c based on the CNN’s interpretation of I_c, which involves a lot of intermediate convolutions and pooling. The higher the level l you choose for re-creating the image, the deeper the pareidolia based on the CNN’s training – or more ‘abstract’ the interpretations.

In fact, this is pretty similar to what DeepDream does for understanding what a deep CNN has ‘learnt’ from its training. The cloud image I showed earlier was indeed used with a CNN trained to recognize animals, leading to some pretty weird imagery:

screen-shot-2017-02-03-at-1-44-05-pm

screen-shot-2017-02-03-at-1-44-11-pm

Now, the paper we use as reference wants to recreate the content of an image pretty accurately, so how do we avoid such misinterpretation of shapes? The answer lies in the use of a powerful CNN trained to recognize a wide variety of objects – like the one developed by Oxford’s Visual Geometry Group (VGG) – VGGNet. VGGNet is freely available online, pre-trained and ready-made (Tensorflow example).

Recreating the Style of an Image

In the last section, we saw that the output from Feature Maps at a certain level (F_l) could be used as a ‘goal’ to recreate an image with conceptually similar content. But what about style or texture?

Intuitively speaking, the style of an image is not as much about the actual objects in it, but rather the co-occurrence of features/shape in the overall visual (Reference). This idea is quantified by the Gramian matrix with respect to the Feature Maps: G(F_l).

Suppose we have n different Feature Maps at level l of CNN C. G(F_l) is a matrix of dimensions n X n, with the element at position [i, j] being the inner product between Feature Maps i and j. Quoting an answer from this Stack-Exchange question, “the inner product between x and y is indicative of how much of y could be described using x“. Essentially, in this case, it quantifies how similar are the trends between the numbers present in Feature Maps i and j (“do triangles and circles occur together in this image?”).

Thus, G(F_l) is used as the Gradient-Descent ‘goal’ instead of F_l while re-creating the artistic style of a photo/image.

The following stack shows style (not content) recreations of the Composition-VII painting by Kandinsky . As you go lower, the images are based on progressively higher/deeper layers of the CNN:

screen-shot-2017-02-03-at-3-13-47-pm

As you will notice, higher layers tend to reproduce more complex and detailed strokes from the original image. This could be attributed to the capture of more high-level details by virtue of feature-extraction and pooling in the Convolutional Network.

Combining Content and Style from two different Images

That brings us to the final part – combining the above two concepts to achieve something like this:

screen-shot-2017-02-03-at-4-27-41-pm

Gradient Descent always considers a target ‘error function’ to minimize while performing optimization. Given two vectors x and y, let this function be denoted by \Lambda(x, y).

Suppose you want to generate an image that has the content of image I_c in the style of image I_s. The white-noise image you start out with, is I_n. Let F^{I} be the output given by a certain set of feature maps based on image I.

Now, if you were only looking to recreate content from I_c, you would be minimizing:

\Lambda(F^{I_n}, F^{I_c})

If you were only interested in the style from I_s, you would minimize:

\Lambda(G(F^{I_n}), G(F^{I_s}))

Combining the two, you get a new function for minimizing:

\alpha*\Lambda(F^{I_n}, F^{I_c}) + \beta*\Lambda(G(F^{I_n}), G(F^{I_s}))

\alpha and \beta are basically the weightage you give to the content and style respectively.

The tiles shown below depict output from the same convolutional layer, but with higher values of \alpha / \beta as you go to the right:

screen-shot-2017-02-03-at-4-42-33-pm

Pretty cool, isn’t it?

Understanding the new Google Translate

screen-shot-2017-01-17-at-4-18-07-pm

Google launched a new version of the Translate in September 2016. Since then, there have been a few interesting developments in the project, and this post attempts to explain it all in as simple terms as possible.

The earlier version of the Translate used Phrase-based Machine Translation, or PBMT. What PBMT does is break up an input sentence into a set of words/phrases and translate each one individually. This is obviously not an optimal strategy, since it completely misses out on the context of the overall sentence. The new Translate uses what Google calls Google Neural Machine Translation (GNMT), an improvement over a traditional version of NMT. Lets see how GNMT works on a high-level:

The Encoder

Before you understand the encoder, you must understand what an LSTM (Long-Short-Term-Memory) cell is. It is basically a Neural Network with some concept of memory. An LSTM is generally used to ‘learn’ patterns in time-series/temporal data. At any given point, it accepts the latest input vector and produces the intended output using a combination of (the latest input + some ‘context’ regarding what it saw before):

main-qimg-0c9574ceedad13aa45f68431c33c6db5

In the above picture, x_t is the input at time t. h_{t-1} represents the context from t-1. If x_t has a dimensionality of d, h_{t-1} of dimensionality 2d is a concatenation of two vectors:

  1. The intended output by the same LSTM at the last time-step t-1 (the Short Term memory), and
  2. Another d-dimensional vector encoding the Long Term memory – also called the Cell State.

The second part is usually not of use for the next component in the architecture. It is instead used by the same LSTM for the following step. LSTMs are usually trained by providing them with a ton of example input-series with the expected outputs. This enables them to learn what parts of the input to retain/hold, and how to mathematically process x_t and h_{t-1} to come up with h_t. If you wish to understand LSTMs better, I recommend this blog post by Christopher Olah.

An LSTM can also be ‘unfolded’, as shown below:

5655a235588a8

Don’t worry, they are copies of the the same LSTM cell (hence same training), each feeding their output to the next one in line. What this allows us to do is give in the entire set of input vectors (in essence, the whole time-series) all at once, instead of going step-by-step with a single copy of the LSTM.

GNMT’s encoder network is essentially a series of stacked LSTMs:

screen-shot-2017-01-16-at-4-17-00-pm

Each horizontal line of pink/green boxes is an ‘unfolded’ LSTM on its own. The above figure therefore has 8 stacked LSTMs in a series. The input to the whole architecture is the ordered set of tokens in the sentence, each represented in the form of a vector. Mind you, I said tokens – not words. What GNMT does in pre-processing, is break up all words into tokens/pieces, which are then fed as a series to the neural network. This enables the framework to (atleast partially) understand unseen complicated words. For example, suppose I say the word ‘Pteromerhanophobia‘. Even though you may not know exactly what it is, you can tell me that it is some sort of fear based on the token ‘phobia‘. Google calls this approach Wordpiece modeling. The break-up of words into tokens is done based on statistical learning (which group of tokens make most sense?) from a huge vocabulary in the training phase.

When you stack LSTMs, each layer learns a pattern in the time series fed to it by the earlier (lower) layer. As you go higher up the ladder, you see more and more abstract patterns from the data that was fed in to the lowest layer. For example, the lowest layer might see a set of points and deduce a line, the next layer will see a set of lines and deduce a polygon, the next will see a set of polygons and learn an object, and so on… Ofcourse, there is a limit to how many and in what way you should stack LSTMs together – more is not always better, since you will ultimately end up with a model thats too slow and difficult to train.

There are a few interesting things about this architecture shown above, apart from the stacking of LSTMs.

screen-shot-2017-01-17-at-3-59-22-pm

You will see that the second layer from the bottom is green in color. This is because the arrows – the ordering of tokens in the sentence – is reversed for this layer. Which means that the second LSTM sees the entire sentence in reverse order. The reason to do this is simple: When you look at a sentence as a whole, the ‘context’ for any word is not just contained in the words preceding it, but also in the words following it. The two bottom-most layers both see the raw sentence as input, but in opposite order. The third LSTM gets this bidirectional input from the first two layers – basically, a combination of the forward and backward context for any given word. Each layer from this point on learns higher-level patterns in the contextual meanings of words in the sentence.

You might also have noticed the ‘+’ signs that appear before providing inputs to the fifth layer and above. This is a form of Residual Learning. This is what happens from layer 5 onwards: For every layer N+1, the input is an addition of the output of layers N and N-1. Take a look at my post on Residual Neural Networks to get a better understanding of what this does.

screen-shot-2017-01-17-at-4-01-09-pm

Lastly, you can see the extra <2es> and </s> characters at the end of the input to the encoder. </s> represents ‘end of input’. <2es>, on the other hand, represents the Target Language – in this case, Spanish. GNMT does this unique thing where they provide the Target Language as input to the framework, to improve performance of Translate. More on this later.

Attention Module and the Decoder

screen-shot-2017-01-17-at-4-07-07-pm

The Encoder produces a set of ordered output-vectors (one for each token in the input). These are then fed into the Attention Module & Decoder framework. To a large extent, the Decoder is similar to the Encoder in design- stacked LSTMs and residual connections. Lets discuss the parts that are different.

I have already mentioned that GNMT considers the entire sentence as input, in every sense. However, it is intuitive to think that for every token that the decoder will produce, it should not give equal weightage to all vectors(tokens) in the input sentence. As you write out one part of the story, your focus should slowly drift to the rest of it. This work is done by the Attention Module. What the Attention Module gets as input, is the complete output of the Encoder and the latest vector from the Decoder stack. This lets it ‘understand’ how much/what has already been translated, and it then directs the Decoder to shift attention to the other parts of the Encoder output.

The Decoder LSTM-stack keeps outputting vectors based on the input from the Encoder and directions from the Attention module. These vectors are given to the Softmax Layer. You can think of the Softmax Layer as a Probability distribution-generator. Based on the incoming vector from the topmost LSTM, the Softmax Layer assigns a probability to every possible output token (remember the target language was already provided to the Encoder, so that information has already been propagated). The token that gets the maximum probability is written out.

The whole process stops once the Decoder/Softmax decides that the current token is </s> (or end-of-sentence). Note that the Decoder does not have to follow a number of steps equal to the output vectors from the Encoder, since it is paying weighted attention to all of those at every step of computation.

Overall, this is how you can visualize the  complete translation process:

nmt-model-fast

Training & Zero-Shot Translation

The complete framework (Encoder+Attention+Decoder) is trained by providing it a huge collection of (input, translated) pairs of sentences. The architecture ‘knows’ the input language in a sense when it converts tokens from the incoming sentence to the appropriate vector format. The target language is provided as a parameter as well. The brilliance of deep-LSTMs lies in the fact that the neural network learns all of the computational stuff by itself, using a class of algorithms called Backpropagation/Gradient Descent.

image01

Heres another amazing discovery made by the GNMT team: Simply by providing the target language as an input to the framework, it is able to perform Zero-Shot Translation! What this basically means is: If during training you provide it examples of English->Japanese & English->Korean translations, GNMT automatically does Japanese->Korean reasonably well! In fact, this is the biggest achievement of GNMT as a project. The intuition: what the Encoder essentially produces is a form of interlingua (or universal language). Whenever I say ‘dog‘ in any language, you end up thinking of a friendly canine – essentially, the concept of ‘dog‘. This ‘concept’ is what is produced by the Encoder, and it is irrespective of any language. In fact, some articles went so far as to say that Google’s AI had invented a language of its own :-D.

Providing the target language as input allows GNMT to easily use the same neural network for training with any pair of languages, which in turn allows zero-shot translations. As a result, the new Translate gets closer than ever before to the way humans perform translations in their mind.

Heres some references if you want to read further on this subject 🙂 :

  1. First blog post about GNMT on the Google Research Blog. (Corresponding Research Paper)
  2. Second blog post about Zero-Shot Translations. This one made the biggest splash. (Corresponding Research Paper)
  3. A great NYTimes article that tells the story behind this Google Translate.

On Interpretable Models

Artificial Intelligence is everywhere today. And as intelligent systems get more ubiquitous, the need to understand their behavior becomes equally important. Maybe if you are developing an algorithm to recognize a cat-photo for fun, you don’t need to know how it works as long as it delivers the results. But if you have deployed a model to predict whether a person will default on a loan or not, and you use it to make your decisions, you better be sure you are doing the right thing – for practical, ethical AND legal reasons.

From Dictionary.com,

interpretabilityto give or provide the meaning of; explain; explicate; elucidate

 

Why do models need to be interpretable?

The primary reason why we need explainability in AI, is to develop a sense of understanding and trust. Think about it – the only way you would ever delegate an important task to someone else, is if you had a certain level of trust in their thought process. If for instance Siri makes an error in understanding your commands, thats fine. But now consider self-driven cars. The reason why most people would not readily sit in a self-driven car within a city, is because we cannot guarantee if it will do the right thing in every situation. Interpretability is thus crucial for building trust towards models, especially in domains like healthcare, finance and the judicial system.

Interpretability is also important while debugging problems in a model’s performance. These problems might be caused due to the algorithm itself, or the data being used to train it. And you may not really observe these issues until you deploy a compound system that uses this model. Lets take the example of Google’s Word2Vec. Word2Vec is currently one of the best algorithms for computing word-embeddings given a significant amount of text. It was originally trained on a 3 million-word corpus of Google News articles. In some research conducted by people from Boston university and Microsoft Research, they found a ton of hidden sexism in the word-embeddings generated from that dataset. For example, the framework came up with this particular analogy: “man : computer programmer :: woman : homemaker”. Funny ain’t it? This was not a problem with the algorithm itself, but a screw-up of the way news articles are usually written. Quoting the source, “Any bias contained in word embeddings like those from Word2vec is automatically passed on in any application that exploits it.”.

How do we increase interpretability of models?

There are two ways to promote interpretability when it comes to Machine Learning/AI systems: Transparency, and Post-Hoc Explainability. Algorithmic transparency would mean that you understand the way your model works on an intuitive level, with respect to the dataset you used for training. A Decision Tree, for example, is pretty transparent – in fact, you can use the paths from the root to every leaf node to decompose the Tree into the set of rules used for classification. But a deep Neural Network is not so transparent, for obvious reasons. Though you may understand linear algebra and back-propagation, you will typically not be able to make sense of the weights/biases learned by a deep-NN after training.

Transparency has two aspects: Decomposability, and Simultaneity. Decomposability would mean understanding each individual component of your model. In essence, there should not be a ‘black-box’ component of the system in your eyes. Simultaneity, on the other hand, indicates an understanding of how all these individual components work together as a whole. And the former does not necessarily imply the latter – consider an algorithm as simple as linear regression. You would probably know that if the weight with respect to a predictor is positive after training, it shows a direct proportionality to the target variable. Now, if you train a simple linear regression of Disease-risk vs Vaccination, you would most probably get a negative weight on the Vaccination variable. But if you now take tricky factors such as immunodeficiency or age (old-age or infancy) into the picture, the weight might take on a whole different value. In fact, as the number of predictor variables goes on increasing in regression, it gets more and more difficult to understand how your model will behave as a whole. And thus, the notion that a ‘simple’ model (linear regression) would be far easier to interpret than a ‘complex’ one (deep learning) is misleading.

Post-Hoc means ‘occurring after an event’. In the context of model transparency, post-hoc interpretation would mean an effort to understand its behavior after it has finished training, typically using some test inputs. Some models/algorithms inherently have the ability to ‘explain’ their behavior. For example, take k-NN classifiers. Along with the required output, you can hack (with minimal effort) the model to return the k-nearest neighbors as examples for scrutiny. This way, you get a good idea of the combination of properties that produce similar results by looking at the known training points.

Most algorithms don’t have such easy post-hoc interpretability, though. In such cases, you have to use techniques such as visualization to understand how they behave/work. For instance, you could use a dimensionality reduction technique such as t-SNE to reduce vector points to 2-3 dimensions and visualize class ‘regions’ in 2D/3D space. Essentially, you are enabling easy visualization of higher-dimensional data by embedding it in a lower-dimensional space. Saliency maps are a technique used to interpret deep neural networks. In Natural Language Processing, textual explanations are also being adopted. Since humans usually understand words better than raw numbers, providing text-based explanations makes sense. For example, in a system like LSI, you could ‘understand’ a word’s embedding by (proportionately) looking at the words that strongly belong to the latent topic(s) is most relates to.

Conclusion and further reading

I did kind-of imply that interpretability is required so that we end up trusting automated systems as much as we trust humans. But as it turns out, its not like human actions are perfectly explainable. There is a ton of research in psychology that clearly indicates that the motivations for our actions are not as clear as we ourselves tend to believe. The Illusion of Conscious Will by Daniel Wegner talks about how our decisions tend to be influenced by subconscious processes without us realizing it. Moreover, it seems contradictory to the ultimate aim of AI to avoid building models that we cannot ‘understand’. If there will be machine intelligence smarter than us, the likelihood of us understanding it completely is pretty slim (Terminator, anyone?).

Heres a couple of links for you to look at, if you want to read more:

  1. EU’s General Data Protection Regulation (GDPR) makes interpretability a top priority.
  2. David Gunning’s summarization of the work on Explainable AI at the International Joint Conference on Artificial Intelligence.
  3. “Is Artificial Intelligence Permanently Inscrutable?” by Aaron Bornstein (Has some good discussion in the Comments section)

The basics of Likelihood

Anyone who has done some course in statistics or data science, must have come across the term ‘likelihood’. In non-technical language, likelihood is synonymous with probability. But ask any mathematician, and their interpretation of the two concepts is significantly different. I went digging into likelihood for a bit this week, so thought of putting down the basics of what I revisited.

Whenever we talk about understanding data, we talk about models. In statistics, a model is usually some sort of parameterized function – like a probability density function (pdf) or a regression model. But the effectiveness of the model’s outputs will only be as good as its fit to the data. If the characteristics of the data are very different from the assumed model, the bias turns out to be pretty high. So from a probability perspective, how do we quantify this fit?

Defining the Likelihood Function

The two entities of interest are – the data X, and the parameters \theta. Now consider a function F(X, \theta), that returns a number proportional to the degree of ‘fit’ between the two – essentially quantifying their relationship with each other.

There are two practical ways you could deal with this function. If you kept \theta constant and varied the data X being analyzed, you would be getting a function F_1(X) whose only argument is the data. The output would basically be a measure of how well your input data satisfies the assumptions made by the model.

But in real life, you rarely know your model with certainty. What you do have, is a bunch of observed data. So shouldn’t we also think of the other way round? Suppose you kept X constant, but tried varying \theta instead. Now, what you have got is a function F_2(\theta), that computes how well different sets of parameters describe your data (which you have for real).

Mind you, in both cases, the underlying mathematical definition is the same. The input ‘variable’ is what has changed. This is how probability and likelihood are related. The function F_1 is what we call the probability function (or pdf for the continuous case), and F_2 is called the likelihood function. While F_1 assumes you know your model and tries to analyze data according to it, F_2 keeps the data in perspective while figuring out how well different sets of parameters describe it.

The above definition might make you think that the likelihood is nothing but a rewording of probability. But keeping the data constant, and varying the parameters has huge consequences on the way you interpret the resultant function.

Lets take a simple example. Consider you have a set C_n of n different coin tosses, where r out of them were Heads, while the others were Tails. Lets say that the coin used for tossing was biased, and the probability of a Heads coming up on it is p. In this case,

F(C_n, p) = {n \choose r} p^r (1-p)^{(n - r)}

Now suppose you made coin yourself, so you know p = 0.7. In that case,

F_1(C_n) = {n \choose r} 0.7^r 0.3^{(n - r)}

On the other hand, lets say you don’t know much about the coin, but you do have a bunch of toss-outcomes from it. You made 10 different tosses, out which 5 were Heads. From this data, you want to measure how likely it is that your guess of p is correct. Then,

F_2(p) = 252 p^5 (1-p)^5

There is a very, very important distinction between probability and likelihood functions – the value of the probability function sums (or integrates, for continuous data) to 1 over all possible values of the input data. However, the value of the likelihood function does not integrate to 1 over all possible combinations of the parameters.

The above statement leads to the second important thing to note: DO NOT interpret the value of a likelihood function, as the probability of the model parameters. If your probability function gave the value of 0.7 (say) for a discrete data point, you could be pretty sure that there would be no other option as likely as it. This is because, the sum of the probabilities of all other point would be equal to 0.3. However, if you got 0.99 as the output of your likelihood function, it wouldn’t necessarily mean that the parameters you gave in are the most likely ones. Some other set of parameters might give 0.999, 0.9999 or something higher.

The only thing you can be sure of, is this: If F_2(\theta_1) >F_2(\theta_2), then it is more likely that \theta_1 denote the parameters of the underlying model.

Log Likelihoods for Maximum Likelihood Estimation

The likelihood function is usually denoted as L(\theta | x) (likelihood L of the parameters \theta given the data point x), so we will stick with it from now on. The most common use of likelihood, is to figure out that set of parameters which yields the highest value for it (and thus describes your dataset the best). This method is called Maximum Likelihood Estimation. You maximize the value of the likelihood function in a bid to find the optimal parameters for your model. This trick is applied in many areas of data science, such as logistic regression.

Maximum Likelihood Estimation usually involves computing the partial derivative of the likelihood function with respect to the parameters. We generally deal with the log-likelihood (basically the logarithm of the likelihood function) rather than likelihood itself. Since log is a monotonically increasing function, the optimum value of the likelihood function can be calculated using derivatives of log-likelihood as well. The reason we use logarithms, is to make the process of dealing with derivatives easy. Consider the coin-toss example I gave above:

Your Likelihood function for the probability of Heads, given the n and r, was:

L(p | r) = {n \choose r} p^r (1-p)^{(n - r)}

Computing log, we get

log(L(p | r)) = log ({n \choose r}) + r log(p) + (n - r) log(1 - p)

To maximise, we will compute the partial derivative with respect to p, and equate to zero.

Using \frac{d(log x)}{dx} = \frac{1}{x} we get,

\frac{r}{p} = \frac{n-r}{1-p}

Solving, we get the intuitive result:

p = \frac{r}{n}

In most cases, when you compute likelihood, you would be dealing with a bunch of independent data points x_1, x_2, ..., x_n, rather than a single one. The likelihood of \theta with respect to the data-set X then gets defined as follows:

L(\theta | X) = L(\theta | x_1) L(\theta | x_2) L(\theta | x_3) ... L(\theta | x_n)

Using log, the overall likelihood becomes:

log(L(\theta | X)) = \sum_x{log(L(\theta | x))}

A small and easy introduction to Transductive Learning

Consider the following problem statement:

Input: a) A set U of labelled examples (x_i, y_i) where every x_i is the input vector, and y_i is the corresponding output label. b) A set V of unlabelled instances x'_i.

Output: The set of expected labels y'_i for all instances in V.

There are two ways (or rather, two philosophies) you could use, to solve this problem. Lets look at the first one most people in ML are acquainted with:

The Inductive way

Induction, in the context of learning, is the attempted discovery of rules/generalizations based on analysis of collected data. ‘Attempted discovery’ is the key term here – the generalizations are not facts, but approximations based on evidence you have gathered.

The main characteristic of inductive learning is the building of a model – those rules/properties you induce from the data to answer your questions, together make up the model. The learning can happen in a supervised or semi-supervised (or even unsupervised) fashion. What you are basically trying to do, is make generalizations that can help you understand/label unseen instances in the future. Statistical inference (which deals with building parametric models) is one of the techniques used in inductive learning.

Concretely speaking, heres how inductive learning will work for the problem defined above:

Input: U for supervised learning (and V too if we go for semi-supervised learning).

Output: The set of expected labels y'_i for all instances in V.

Approach: Compute a function f, using information in U, such that f(x_i) is as close to y_i as possible over all instances in U. Using this function, compute each y'_i as y'_i =f(x'_i).

 

The other way to go about this would be…

The Transductive way

Read the problem statement at the beginning of this post, once again. The output which the problem definition asks for, is the labels of instances in V only. What we tried to do in inductive learning, is build a model to label any unseen instances in the future. If you think about it, we basically solved a problem thats more general than the one we needed to solve. Instead of building a universal model – which we don’t need anyway – could we leverage the information contained in the instances of V (with respect to those in U) to make better predictions for V specifically? That is precisely what Transductive Learning tries to do.

Think about it this way. Suppose you were shown a pack of dogs, where each dog’s breed can be either A or B. Some of the dogs have been tagged with their correct breed, but many are not. You are asked to label the unlabelled dogs with their respective breeds. What would you do? Apart from the characteristics of dogs from A and B, wouldn’t it make sense to observe the unlabelled dogs and their interactions and similarities to those from A and B, to make good guesses for their breeds? That is the philosophy of Transduction.

Transduction, in the context of learning, refers to reasoning from specific observed (training) instances, to specific observed (unlabelled) instances. It was introduced by Vladmir Vapnik, with the core thought behind it being this:

“When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need but not a more general one.”

To avoid reinventing the wheel, please refer to the Wikipedia example problem to get an an intuitive understanding of how many Transductive learning algorithms work. For those of you who are acquainted with Support Vector Machines (SVMs), theres a transductive version of them around too. If you don’t want to go into the mathematical details, heres a summary (Basically, you incorporate the unlabelled samples into determining what the ‘maximum margin’ actually is):

semisupervised-learning-6-638

It is important to understand that not all Semi-Supervised Learning methods are Transductive in nature. The inclusion of unlabelled samples in the training is not the primary characteristic of transduction – the avoidance of building a general ‘model’ is. We are using the information implicit in the instances whose output is required, to understand them better.

The major disadvantage of Transductive learning is pretty obvious: The information you ‘learn’ cannot be used to label new instances (which you did not have during training) – since you are not building a model. Essentially, every time you want to classify a new set of instances, you have to re-do the whole training again. Therefore, it makes most sense only when your goals (in terms of the instances you want to understand) are specific.

Another thing to note, is that the effectiveness of Transduction can suffer if you have some really noisy samples in your data. Think about the dogs example I gave. If some dogs from the unlabelled ones were from some random breed C (or cross-breeds of A and B), your internal understanding of the entire pack would suffer. You would see some dogs behaving randomly with others, some showing characteristics of both – it might make you question the labelled dogs as well! Since the distribution(s) of unlabelled data are used along with the training data, the integrity of the whole dataset (or at least most of it) is a prerequisite in Transductive learning.

Non-Mathematical Feature Engineering techniques for Data Science

“Apply Machine Learning like the great engineer you are, not like the great Machine Learning expert you aren’t.”

This is the first sentence in a Google-internal document I read about how to apply ML. And rightly so. In my limited experience working as a server/analytics guy, data (and how to store/process it) has always been the source of most consideration and impact on the overall pipeline. Ask any Kaggle winner, and they will always say that the biggest gains usually come from being smart about representing data, rather than using some sort of complex algorithm. Even the CRISP data mining process has not one, but two stages dedicated solely to data understanding and preparation.

So what is Feature Engineering?

Simply put, it is the art/science of representing data is the best way possible.

Why do I say art/science? Because good Feature Engineering involves an elegant blend of domain knowledge, intuition, and basic mathematical abilities. Heck, the most effective data representation ‘hacks’ barely involve any mathematical computation at all! (As I will explain in a short while).

What do I mean by ‘best’? In essence, the way you present your data to your algorithm should denote the pertinent structures/properties of the underlying information in the most effective way possible. When you do feature engineering, you are essentially converting your data attributes into data features.

Attributes are basically all the dimensions present in your data. But do all of them, in the raw format, represent the underlying trends you want to learn in the best way possible? Maybe not. So what you do in feature engineering, is pre-process your data so that your model/learning algorithm has to spend minimum effort on wading through noise. What I mean by ‘noise’ here, is any information that is not relevant to learning/predicting your ultimate goal. In fact, using good features can even let you use considerably simpler models since you are doing a part of the thinking yourself.

But as with any technique in Machine Learning, always use validation to make sure that the new features you introduce really do improve your predictions, instead of adding unnecessary complexity to your pipeline.

the-how-and-why-of-feature-engineering-5-638

As mentioned before, good feature engineering involves intuition, domain knowledge (human experience) and basic math skills. So heres a few extremely simple techniques for you to (maybe) apply in your next data science solution:

1. Representing timestamps

Time-stamp attributes are usually denoted by the EPOCH time or split up into multiple dimensions such as (Year, Month, Date, Hours, Minutes, Seconds). But in many applications, a lot of that information is unnecessary. Consider for example a supervised system that tries to predict traffic levels in a city as a function of Location+Time. In this case, trying to learn trends that vary by seconds would mostly be misleading. The year wouldn’t add much value to the model as well. Hours, day and month are probably the only dimensions you need. So when representing the time, try to ensure that your model does require all the numbers you are providing it.

And not to forget Time Zones. If your data sources come from different geographical sources, do remember to normalize by time-zones if needed.

2. Decomposing Categorical Attributes

Some attributes come as categories instead of numbers. A simple example would be a ‘color’ attribute that is (say) one of {Red, Green, Blue}. The most common way to go about representing this, is to convert each category into a binary attribute that takes one value out of {0, 1}. So you basically end up with a number of added attributes equal to the number of categories possible. And for each instance in your dataset, only one of them is 1 (with the others being 0). This is a form of one-hot encoding.

If you are new to this idea, you may think of decomposition as an unnecessary hassle (we are essentially bloating up the dimensionality of the dataset). Instead, you might be tempted to convert the categorical attribute into a scalar value. For example, the color feature might take one value from {1, 2, 3}, representing {Red, Green, Blue} respectively. There are two problems with this. First, for a mathematical model, this would mean that Red is somehow ‘more similar’ to Green than Blue (since |1-3| > |1-2|). Unless your categories do have a natural ordering (such as stations on a railway line), this might mislead your model. Secondly, it would make statistical metrics (such as mean) meaningless – or worse, misleading yet again. Consider the color example again. If your dataset contains equal numbers of Red and Blue instances but no Green ones, the ‘average’ value of color might still come out to be ~2 – essentially meaning Green!

The safest place to convert a categorical attribute into a scalar, is when you have only two categories possible. So you have {0, 1} corresponding to {Category 1, Category 2}. In this case, an ‘ordering’ isn’t really required, and you can interpret the value of the attribute as the probability of belonging to Category 2 vs Category 1.

3. Binning/Bucketing

Sometimes, it makes more sense to represent a numerical attribute as a categorical one. The idea is to reduce the noise endured by the learning algorithm, by assigning certain ranges of a numerical attribute to distinct ‘buckets’. Consider the problem of predicting whether a person owns a certain item of clothing or not. Age might definitely be a factor here. What is actually more pertinent, is the Age Group. So what you could do, is have ranges such as 1-10, 11-18, 19-25, 26-40, etc. Moreover, instead of decomposing these categories as in point 2, you could just use scalar values, since age groups that lie ‘closer by’ do represent similar properties.

Bucketing makes sense when the domain of your attribute can be divided into neat ranges, where all numbers falling in a range imply a common characteristic. It reduces overfitting in certain applications, where you don’t want your model to try and distinguish between values that are too close by – for example, you could club together all latitude values that fall in a city, if your property of interest is a function of the city as a whole. Binning also reduces the effect of tiny errors, by ’rounding off’ a given value to the nearest representative. Binning does not make sense if the number of your ranges is comparable to the total possible values, or if precision is very important to you.

4. Feature Crosses

This is perhaps the most important/useful one of these. Feature crosses are a unique way to combine two or more categorical attributes into a single one. This is extremely useful a technique, when certain features together denote a property better than individually by themselves. Mathematically speaking, you are doing a cross product between all possible values of the categorical features.

Consider a feature A, with two possible values {A1, A2}. Let B be a feature with possibilities {B1, B2}. Then, a feature-cross between A & B (lets call it AB) would take one of the following values: {(A1, B1), (A1, B2), (A2, B1), (A2, B2)}. You can basically give these ‘combinations’ any names you like. Just remember that every combination denotes a synergy between the information contained by the corresponding values of A and B.

For example, take the diagram shown below:

svm_nonlinear_2class_quadrant

All the blue points belong to one class, and the red ones belong to another. Lets put the actual model aside. First off, you would benefit from binning the X, Y values into {x < 0, x >= 0} & {y < 0, y >= 0} respectively. Lets call them {Xn, Xp} and {Yn, Yp}. It is pretty obvious that Quadrants I & III correspond to class Red, and Quadrants II & IV contain class Blue. So if you could now cross features X and Y into a single feature ‘Quadrant’, you would basically have {I, II, III, IV} being equivalent to {(Xp, Yp), (Xn, Yp), (Xn, Yn), (Xp, Yn)} respectively.

A more concrete/relatable example of a (possibly) good feature cross is something like (Latitude, Longitude). A common Latitude corresponds to so many places around the globe. Same goes for the Longitude. But once you combine Lat & Long buckets into discreet ‘blocks’, they denote ‘regions’ in a geography, with possibly similar properties for each one.

Sometimes, attributes can also be ‘combined’ into a single feature with simple mathematical hacks. In the above example, suppose you define modified features X_{sign} and Y_{sign} as follows:

X_{sign} = \frac{x}{|x|}

Y_{sign} = \frac{y}{|y|}

Now, you could just define a new feature Quadrant_{odd} as follows:

Quadrant_{odd} = X_{sign}Y_{sign}

Thats all! If Quadrant_{odd} = 1, the class is Red. Else, Blue!

 

For sake of completeness, I will also mention some mathematically intensive feature engineering techniques, with links for you to read more about them:

5. Feature Selection : Using certain algorithms to automatically select a subset of your original features, for your final model. Here, you are not creating/modifying your current features, but rather pruning them to reduce noise/redundancy.

6. Feature Scaling : Sometimes, you may notice that certain attributes have a higher ‘magnitude’ than others. An example might be a person’s income – as compared to his age. In such cases, for certain models (such as Ridge Regression), it is infact necessary that you scale all your attributes to comparable/equivalent ranges. This prevents your model from giving greater weightage to certain attributes as compared to others.

7. Feature Extraction : Feature extraction involves a host of algorithms that automatically generate a new set of features from your raw attributes. Dimensionality reduction falls under this category.

The magic behind Attribute Access in Python

Most people know just one thing when it comes to attribute access – the dot ‘.’ (as in x.some_attribute). In simple terms, attribute access is the way you retrieve an object linked to the one you already have. To someone who uses Python without delving too much into the details, it may seem pretty straightforward. However, under the hood, theres a lot that goes on for this seemingly trivial task.

Lets look at each of the components one by one.

The __dict__ attribute

Every object in Python has an attribute denoted by __dict__. This dictionary/dictionary-like (I will explain this shortly) object contains all the attributes defined for the object itself. It maps the attribute name to its value.

Heres an example:


>>> class C(object):
	x = 4

>>> c = C()
>>> c.y = 5
>>> c.__dict__
{'y': 5}

Notice how 'x' is not in c.__dict__. The reason for this is simple enough. While y was defined for the object c, x was defined for its class (C). Therefore, it will appear in the __dict__ of C. In fact, C‘s __dict__ contains a lot of other keys too (including '__dict__'):


>>> c.__class__.__dict__['x']
4
>>> c.__class__.__dict__
dict_proxy({'__dict__': <attribute '__dict__' of 'C' objects>, 'x': 4, 
'__module__': '__main__', '__weakref__': <attribute '__weakref__' of 'C' objects>, 
'__doc__': None})

We will look at what dictproxy means soon.

The __dict__ of an object is simple enough to understand. It behaves like a Python dict, and is one too.


>>> c.__dict__
{'y': 5}
>>> c.__dict__.__class__
<type 'dict'>
>>> c.__dict__ = {}
>>> c.y

Traceback (most recent call last):
  File "<pyshell#81>", line 1, in <module>
    c.y
AttributeError: 'C' object has no attribute 'y'
>>> c.__dict__['y'] = 5
>>> c.y
5

The __dict__ of a class however, is not that straight-forward. Its actually an object of a class called dictproxy. dictproxy is a special class whose objects behave like normal dicts, but they differ in some key behaviours.


>>> C.__dict__
dict_proxy({'__dict__': <attribute '__dict__' of 'C' objects>, 'x': 4, '__module__': '__main__', '__weakref__': <attribute '__weakref__' of 'C' objects>, '__doc__': None})
>>> C.__dict__.__class__
<type 'dictproxy'>
>>> C.__dict__['x']
4
>>> C.__dict__['x'] = 6

Traceback (most recent call last):
  File "<pyshell#87>", line 1, in <module>
    C.__dict__['x'] = 4
TypeError: 'dictproxy' object does not support item assignment
>>> C.x = 6
>>> C.__dict__ = {}

Traceback (most recent call last):
  File "<pyshell#89>", line 1, in <module>
    C.__dict__ = {}
AttributeError: attribute '__dict__' of 'type' objects is not writable

Notice how you cannot set a key in a dictproxy directly (C.__dict__['x'] = 4 does not work). You can accomplish the same using C.x = 6 however, since the internal behaviour then is different. Also notice how you cannot set the __dict__ attribute itself either(C.__dict__ = {} does not work).

Theres a reason behind this weird implementation. If you don’t want to get into the details, just know that its for the Python interpreter to keep working properly, and to enforce some optimizations. If you want a more detailed explanation, have a look at Scott H’s answer to this StackOverflow question.

Descriptors

A descriptor is an object that has atleast one of the following magic methods in its attributes: __get__, __set__ or __delete__ (Remember, methods are ultimately objects in Python). Mind you, its the object we are talking about. Its class may or may not have implemented them.

Descriptors can help you define the behaviour of an object’s attribute in Python. With each of the magic methods just mentioned, you implement how the attribute (‘described’ by the descriptor) will be retrieved, set and deleted in the object respectively. There are two types of descriptors – Data Descriptors, and Non-Data Descriptors.

Non-Data Descriptors only have __get__ defined. All others are Data Descriptors. You would naturally think, why these two types are called so. The answer is intuitive. Usually, its data-related attributes that we tend to ‘set’ or ‘delete’ with respect to an object. Other attributes, like methods themselves, we don’t. So their descriptors are called Non-Data Descriptors. As with a lot of other things in Python, this is not a hard-and-fast rule, but a convention. You could just as well describe a method with a Data Descriptor. But then, its __get__ should return a function.

Heres an example of two classes that will come up with data and non-data descriptor objects respectively:


class DataDesc(object):
    def __init__(self, name):
        self._name = name

    def __get__(self, obj, objclass):
        try:
            print("Retrieving attr " + self._name + " from " +
                  str(obj) + "...")
            return objclass.x + " + " + obj.y
        except:
            raise AttributeError("Attr " + self._name + " could not be " +
                                 "retrieved from " + str(obj))
    
    def __set__(self, obj, value):
        raise AttributeError("Attr " + self._name + " cannot be " +
                             "set in " + str(obj))

    def __delete__(self, obj):
        raise AttributeError("Attr " + self._name + " cannot be " +
                             "deleted in " + str(obj))

class NonDataDesc(object):
    def __init__(self, name):
        self._name = name

    def __get__(self, obj, objclass):
        try:
            print("Retrieving attr " + self._name + " from " +
                  str(obj) + "...")
            return objclass.x + " + " + obj.y
        except:
            raise AttributeError("Attr " + self._name + " could not be " +
                                 "retrieved from " + str(obj))

Notice how the __get__ function takes in an object obj and (its) class objclass. Similarly, setting the value requires obj and some candidate value. Deletion just needs obj. Taking these parameters in (along with the initializer __init__) helps you differentiate between objects of the same descriptor class. Mind you, its the objects that are intended to be the descriptors.
(P.S. If you don’t define the __get__ method for a descriptor, the descriptor object itself will get returned).

Lets use these classes in some code.


class ParentClass(object):
    x = "x1"
    y = "y1"
    data_attr_parent = DataDesc("desc1")
    data_attr_child = DataDesc("desc2")

class ChildClass(ParentClass):
    x = "x2"
    y = "y2"
    data_attr_child = DataDesc("desc3")
    non_data_attr_child = NonDataDesc("desc4")

some_object = ChildClass()

Thats it! You can access the ‘described’ objects as usual in Python.


>>> some_object.data_attr_parent
Retrieving attr desc1 from <__main__.ChildClass object at 0x1062c5790>...
'x2 + y2'

Descriptors are used for a lot of attribute and method related functionality in Python, including static methods, class methods and properties. Using descriptors, you can gain better control over how attributes and methods of a class/its objects are accessed – including defining some ‘behind the scenes’ functionality like logging.

Now lets look at the high-level rules governing attribute access in Python.

The Rules

Quoting Shalabh Chaturvedi’s book verbatim, the workflow is as follows:

  1. If attrname is a special (i.e. Python-provided) attribute for objectname, return it.
  2. Check objectname.__class__.__dict__ for attrname. If it exists and is a data-descriptor, return the descriptor result. Search all bases of objectname.__class__ for the same case.
  3. Check objectname.__dict__ for attrname, and return if found. If objectname is a class, search its bases too. If it is a class and a descriptor exists in it or its bases, return the descriptor result.
  4. Check objectname.__class__.__dict__ for attrname. If it exists and is a non-data descriptor, return the descriptor result. If it exists, and is not a descriptor, just return it. If it exists and is a data descriptor, we shouldn’t be here because we would have returned at point 2. Search all bases of objectname.__class__for same case.
  5. Raise AttributeError

 

To make things clearer, heres some tinkering using the code we wrote in the Descriptors section (Have a look at it again just to be clear about things):

data_attr_child is a Data descriptor in some_object‘s class. So you cant write over it. Also, the version in ChildClass (‘desc3’) is used, not the one in ParentClass.


>>> some_object.data_attr_child
Retrieving attr desc3 from <__main__.ChildClass object at 0x1110c9790>...
'x2 + y2'
>>> some_object.data_attr_child = 'xyz'

Traceback (most recent call last):
  File "<pyshell#112>", line 1, in <module>
    some_object.data_attr_child = 'xyz'
  File "/Users/srjoglekar/metaclasses.py", line 16, in __set__
    "set in " + str(obj))
AttributeError: Attr desc3 cannot be set in <__main__.ChildClass object at 0x10883f790>

Infact, even if you make an appropriate entry in some_object‘s dict, it still won’t matter (as per Rule 1).


>>> some_object.__dict__['data_attr_child'] = 'xyz'
>>> some_object.data_attr_child
Retrieving attr desc3 from <__main__.ChildClass object at 0x10883f790>...
'x2 + y2'

The Non-Data Descriptor attribute, on the other hand, can be easily overwritten.


>>> some_object.non_data_attr_child
Retrieving attr desc4 from <__main__.ChildClass object at 0x10883f790>...
'x2 + y2'
>>> some_object.non_data_attr_child = 'xyz'
>>> some_object.non_data_attr_child
'xyz'
>>> some_object.__dict__
{'data_attr_child': 'xyz', 'non_data_attr_child': 'xyz'}

You can, however, change the behaviour of data_attr_child, if you go to some_object‘s class and modify it in the dictproxy there itself.


>>> some_object.__class__.data_attr_child = 'abc'
>>> some_object.data_attr_child
'xyz'

Notice how the moment you replace the Data-Descriptor in the class with some non-data descriptor (or some object like a String in this case), the entry that we initially made in some_object‘s __dict__ comes into play. Therefore, some_object.data_attr_child returns 'xyz', not 'abc'.

The data_attr_parent attribute behaves similar to data_attr_child.


>>> some_object.data_attr_parent
Retrieving attr desc1 from <__main__.ChildClass object at 0x10883f790>...
'x2 + y2'
>>> some_object.data_attr_parent = 'xyz'

Traceback (most recent call last):
  File "<pyshell#127>", line 1, in <module>
    some_object.data_attr_parent = 'xyz'
  File "/Users/srjoglekar/metaclasses.py", line 16, in __set__
    "set in " + str(obj))
AttributeError: Attr desc1 cannot be set in <__main__.ChildClass object at 0x10883f790>
>>> some_object.__class__.data_attr_parent = 'xyz'
>>> some_object.__class__.data_attr_parent
'xyz'

Notice how you cant ‘write-over’ data_attr_parent in ChildClass itself. Once you do that, we go through Rules 1-2-3 and stop at 4, to get the result 'xyz'.

Rules for Setting Attributes

Way simpler than the rules for ‘getting them’. Quoting Shalabh’s book again,

  1. Check objectname.__class__.__dict__ for attrname. If it exists and is a data-descriptor, use the descriptor to set the value. Search all bases of objectname.__class__ for the same case.
  2. Insert something into objectname.__dict__ for key "attrname".

Thats it! :-).

__slots__

To put it concisely, __slots__ is a way to disallow objects from having their own __dict__ in Python. This means, that if you define __slots__ in a Class, then you cannot set arbitrary attributes(apart from the ones mentioned in the ‘slots’) on its objects.

Heres an example of such a class:


class SomeClass(object):
    __slots__ = ['x', 'y']

obj = SomeClass()

Now see how this behaves:


>>> obj.x = 4
>>> obj.y = 5
>>> obj.x
4
>>> obj.y
5
>>> obj.z = 6

Traceback (most recent call last):
  File "<pyshell#135>", line 1, in <module>
    obj.z = 6
AttributeError: 'SomeClass' object has no attribute 'z'

You can ofcourse do this:


>>> obj.__class__.z = 6
>>> obj.z
6

But then, remember you have now defined z in SomeClass‘s __dict__, not in obj‘s.

As Guido van Rossum himself mentions in his blog post, __slots__ were implemented in Python to introduce efficiency, not ‘stricter’ attribute-setting. The basic intuition is this: Suppose you have a class, whose objects you intend to construct in a large number. You don’t really need the flexibility of having ‘dynamic’ attributes on the objects themselves, but you want efficiency. Since slots essentially eliminates the __dict__ attribute in each one of the objects, you get a lot of memory savings this way.

Interestingly, slots are implemented using descriptors in Python.

 

Further Reading

Have a look at this book I have already quoted in the post. It goes into a lot of detail regarding attribute access in Python, including method resolutions.

Thats all for now. Cheers!

Joined Google :-D

Hello people! I haven’t really blogged in quite some time, and I kind-of feel guilty about it :-). Truth is, I have been busy starting a new job/life at Google, as a Software Engineer at the Hyderabad office. I joined the Google Apps for Work team, and I work on the analytics part of things. I was pretty swamped with setting things up, getting the formalities done, finding a place to live, blah blah – Basically learning to be an adult! Things have finally settled down now, and I have (I think) found my groove when it comes to my overall routine – so I will probably start blogging as usual in the coming weeks.

To answer the obvious question, life at Google is …well…pretty awesome. Let me write out some points about my experience at the company thats supposed to be one of the best employers in the world. And as per my usual style, they will be bullet points, since putting down a coherent/easy-flowing train of thoughts is just beyond my abilities as a writer.

I. The cool work culture!! I don’t know how many other companies do this (since this is my first real job) but its definitely not all of them. Google doesn’t just talk about an open workspace, it does follow it. My team lead engineer sits on a desk thats exactly like the one I use (except for a lot of Swag he has that I don’t), and I can just go drop by if I have any doubts/issues regarding pretty much anything (even non work related). If my mentor needs to talk to me, he will just come by and sit on a bean-bag lying around my desk and discuss things like a friend at college. You literally won’t be able to tell Tech-levels of people at Google, and thats something I feel is really nice about the office culture.

II. Getting intimidated by the people and the technology. This is a big one. I don’t think I have ever felt so…small…in front of people around me, anytime before in my life. To put it honestly, I felt like an idiot around my team for the first week or two. Not because they acted in any such way – in fact I had to stalk them online to know more about them – but because all of them are basically smartasses. And the other aspect is obviously that you get overwhelmed by the internal infrastructure at a company like Google. Its just so vast and there are so many parts and bits and pieces working together, that its difficult to wrap your head around all of them at first. Its all humbling.

III. The perks!! I don’t really think I need to explain this one (since its covered extensively in articles online). We at Hyderabad don’t have the perks that Mountain View does, but its still pretty darn amazing. Free transport, Bunker rooms, Free food (Oh the damn awesome food :-D), Microkitchens, Games rooms on all floors, Amazing gym, Massage centre, Free internet at home, Matching your charity (if you do any), Techstop for chargers and stuff…the list goes on. Who knows, once we have the campus, the list might expand even further!

Trust me, I know I sound like a freaking fanboy throughout the post. But well… I am barely one year out of my college term, and this is pretty much paradise for a luxury-loving dork like me. I hope to have a great life here at Google, and justify the whole process by making as much of an impact as I can. Cheers!

A practical introduction to Functional Programming for Python coders

This post acquaints the reader with the fundamentals of Functional Programming in the context of Python. Most programmers rarely touch upon languages with a primary functional focus- such as Lisp or Haskell, except maybe as a part of an academic course. Since Python is a widely-used language that supports (mostly) all functional programming constructs, this post tries to demonstrate their usage and advantages. Functional Programming may not be the best/Pythonic way of doing everything in this language, but it has its advantages in some applications and that is what this post is all about.

1. What is Functional Programming?

Functional programming is a programming paradigm that revolves around pure functions. If you have ever done coding in your life, you probably associate functions with subroutines. That is the ‘CS’ perspective. Here, we lean more towards the mathematical definition:

220px-Function_machine2.svgA pure function is essentially any function that can be represented as a mathematical expression only. Which means no side-effects: no I/O operations, no global state changes, no database interactions (You can’t really represent these in a mathematical expression, can you?). The output from a pure function is ONLY dependent on its inputs. So if you were to call a pure function with the same inputs a million times, you would get the same result every single time. Ofcourse, in practical functional programming languages such as Haskell, there are ways to do I/O operations etc., but the focus is still on pure functions.

2. The lambda construct

The easiest way to initialize a pure function in Python is by using the lambda operative:


>>> square_func = lambda x: x**2
>>> square_func(2)
4
>>> some_list = [(1, 3), (5, 2), (6, 1), (4, 6)]
>>> some_list.sort(key=lambda x: x[0]**2 - x[1]*3)
>>> some_list
[(1, 3), (4, 6), (5, 2), (6, 1)]

 

The lambda operative helps you define functions in a one-line fashion, which is pretty convenient if its just a mathematical expression anyways. In fact, the lambda keyword is pretty prominent in functional programming (and not just Python), and has its roots in Lambda Calculus – one of the ‘ancestors’ of functional programming.

Functions initialized with lambda can also be called anonymous functions. If you look at line 5 in the code above, you are passing a lambda-initialized function to the sort method. You aren’t really giving it a name, just defining it on-the-go and passing it as an argument. Hence the term ‘anonymous’. Ofcourse, you can always assign anonymous functions to a variable (as on line 1) and call them like usual functions (as on line 2).

3. Functions as First-Class citizens

In Functional Programming, functions are first-class citizens. Which basically means that you can treat them as any other objects – you can assign them to variables, you can pass them as arguments, or even get them returned from other functions. You already saw some of this happening in the code before. Here’s some more:


>>> square_func = lambda x: x**2
>>> function_product = lambda F, m: lambda x: F(x)*m
>>> square_func(2)
4
>>> function_product(square_func, 3)(2)
12

 

square\_func is a function by itself. function\_product on the other hand, is a higher-order function that takes two inputs- A function F(x) and a multiplier m. It returns a function F'(x) which is equal to m*F(x). Therefore on line 5, function\_product(square\_func, 3) is a function that is being called with the argument 2, returning 2^{2}*3 = 12.

4. Immutability of Data and Data Flows

Immutability in the context of objects means that you never modify the value of data once initialized. In Functional Programming, whenever you call a function on some data, you always get new instances as a result – you never ‘update’ the value of any arguments. Programmatically, this implies that once you initialize a variable like x = 3, the variable x will never appear on the LHS of a statement again.

As a result, any functional code can be thought of as a feed-forward data flow. You never ‘come back’ to change the value of any variable, hence data always moves forward from the inputs to the ultimate output(s) – from one function, to another.

images

This immutability of data leads to another property called Referential Transparency. It means that the value of an expression is the same anywhere it might occur in the program – as long as the required variables are defined. Since you never update the value of any variables/objects (including functions), they mean the same in any context once defined. For this reason, functional code is extremely easy to analyze and debug. You never need to track the value of state variables or remember any updates.

This enables usage of memoization – you basically ‘remember’ the outputs of expensive functions with some common arguments in a sort of lookup table. This reduces the computational complexity at the expense of memory.

5. Recursion

Functional programming doesn’t really provide for iteration via while or for statements. Neither does it have the provision of state-updates. As a result, recursion is a strong theme in functional programming. It is worthwhile to remember that any iterative code can be converted to recursive code.

Here’s a functional version of the n-th Fibonacci number calculator:


>>> fibonacci = (lambda x, x_1=1, x_2=0:
	     x_2 if x == 0
	     else fibonacci(x - 1, x_1 + x_2, x_1))
>>> fibonacci(1)
1
>>> fibonacci(5)
5
>>> fibonacci(6)
8

Line 2 defines the base case of the computation, while line 3 makes the recursive call. If you think about it, x, x\_1 and x\_2 are essentially state variables whose updated versions are being made and passed into every recursive call. This is usually how state is handled in functional code.

It is always better to implement tail-recursion when writing functional code, especially in pure-functional languages such as Scheme. There’s a good reason for this. Tail-recursive code is easily optimized into iterative code by the under-lying compiler (though this doesn’t apply to Python), making the compiled code more efficient.

6. Lazy evaluation

This is an aspect of Functional Programming that Python does not adopt. In many pure functional languages such as Haskell, objects that don’t necessarily need evaluation are not evaluated. Evaluation means computing the value of a function expression. Consider the following line:


length([3 + 4, 5, 1/0])

In a language like Python, the presence of 1/0 would case an Exception immediately. However, if we were to implement lazy-evaluation, the value 3 would be returned- since there exist 3 objects in the list, and their values don’t need to be evaluated for the count. This causes a sort of graph reduction in the data flow, leading to lesser function calls (at the risk of ignoring errors).

Python 3.x does support a different kind of lazy evaluation, that returns iterators for calls such as some\_dict.keys(). This prevents the entire data from being loaded into memory, and is efficient from a programming perspective.

7. No iterators as sequences

This is a small point, but since the value of the next element in an iterator depends on its state(which violates Referential Transparency), iterators aren’t present in pure-functional code. Instead if we are writing pure-functional code we only deal with explicit immutable tuples – which you can generate from an iterator in Python using tuple().

7. map, reduce and filter

map, reduce and filter are three higher-order functions that appear in all pure functional languages – and in Python, too. Their prevalence suggests how often they are used in functional code to make it more elegant.

map basically provides a kind of parallelism by calling a function over all elements in a list/array. Here’s an example:


>>> map(lambda x: x**2, [1, 2, 3])
[1, 4, 9]

Notice how map can be parallelized – since you are calling the same function over all elements in an array without any modifications, you can make the calls in any given order.

filter offers another parallelism over a sequence. It takes as an input a boolean-returning function and a sequence, and retains only those values from the sequence that return True from the function. Example:


>>> filter(lambda x: x % 2 == 0, [1,2,3,4,5,6])
[2, 4, 6]

The above line filters out all the odd numbers from the list.

reduce is a construct that performs a serial iteration over a sequence. Its first argument is a function F(a, x), which takes in two arguments – An accumulator a and the current input x. F computes and returns the new value of a. The second argument to reduce is the sequence itself, [x1, x2, ..., x_n], and the third argument is the initial accumulator a. Internally, what it does is:

i.   a_1 = F(a, x1)

ii.  a_2 = F(a_1, x2)

iii. a_3 = F(a_2, x3)

and so on… What you get returned at the end, is a_n. An example:


>>> reduce(lambda x, y: x + y, [1, 2, 3], 0)
6

The accumulator doesn’t have to be of the same type as the elements in the list! It can be a sequence as well. Here’s an example of reduce being used to reverse a sequence:


>>> reduce(lambda L, element: [element] + L, [1, 2, 3], [])
[3, 2, 1]

Notice how no list is being modified in the above line of code. (As a side note, reduce needs to be imported from the functools library from Python 3.x).

Also, understand that since every call of F that reduce makes on an element depends on the elements that come before it, reduce cannot be parallelized.

Notice the usage of the words ‘map’ and ‘reduce’? These operators in Functional programming provide a good context to understand MapReduce, and are even touched upon in the old Google Lectures on the subject.

End note1: This blog post provides very succinct examples and code for you to better understand how functional programming should be done in Python using the aforementioned keywords/operators. It also introduces some advanced techniques such as pipelines. Do have a look!

End Note2: This article provides some perspective on why functional programming is the preferred approach in only specialized operations, and comments on the human-understandability of functional code (which is low).