Weekly Review: 12/23/2017

Happy Holidays people! If you live in the Bay Area then the next week is probably your time off, so I hope you have fun and enjoy the holiday season! As for Robotics, I just finished Week 2 of Perception, and will probably kick off Week 3 in 2018. I am excited for the last ‘real’ course (Estimation & Learning), and then building my own robot as part of the ‘Capstone’ project after that :-D.

This week’s articles:

XGBoost

I recently came across XGBoost (eXtreme Gradient Boosting), an improvement over standard Gradient Boosting – thats actually a shame, considering how popular this method is in Data Science. If you are rusty on ensemble learning, take a look at this article on bagging/random Forests, and my own intro to Boosting.

XGBoost is one of the most efficient versions of Gradient Boosting, and apparently works really well on structured/tabular data. It also provides features such as sparse-awareness (being able to handle missing values), and the ability to update models with ‘continued training’. Its effectiveness for tabular data has made it very popular with Kaggle winners, with one of them quoting: “When in doubt, use xgboost”!

Take a look at the original paper to dig deeper.

Quantum Computing + Machine Learning

A lot of companies, such as Google, Microsoft, etc have recently shown interest in the domain of Quantum Computing. Rigetti happens to be a startup that aims to rival these juggernauts with its great solution to cloud-Quantum Computing (called Forest). They even have their own Python integration!

The article in question details their efforts to prototype simple clustering with quantum computing. It is still pretty crude, and is by no means a replacement to traditional systems – for now. One of the major critical points is “Applying Quantum Computing to Machine Learning will only make a black-box system more difficult to understand”. This is infact true, but the author suggests that ML could actually/maybe help us understand the behavior of Quantum Computers by modelling them!

Breaking a CAPTCHA with ML

A simple, easy-to-read, fun article on how you could break the simplest CAPTCHA algorithms with CV+Deep Learning.

Learning Indexing Structures with ML

Indexing structures are essentially data structures meant for efficient data access. For example, a B-Tree Index is used for efficient range-queries, a Hash-table is used for fast key-based access, etc. However, all of these data structures are pretty rigid in their behavior – they do not fine-tune/change their parameters based on the structure of the data.

This paper (that includes the Google legend Jeff Dean as an author) explores the possibility of using Neural Networks (infact, a hierarchy of them) as indexing structures. Basically, you would use a Neural Network to compute the function – f: data -> hash/position.

Some key takeaways from the paper:

  1. Range Index models essentially ‘learn’ a cumulative distribution function.
  2. The overall ‘learned index’ by this paper is a hierarchy of models (but not a tree, since two models at a certain layer can point to the same model in the next layer)
    1. As you go down the layers, the models deal with smaller and smaller subsets of the data.
  3. Unlike a B-Tree, no ‘search’ involved, since each model predicts the next model for hash generation.

Tacotron 2

This post on the Google Research blog details the development of a WaveNet-like framework to generate Human Speech from text.

 

Weekly Review: 11/18/2017

I finished the Motion Planning course from Robotics this week. It was expected, since the material was quite in line with data structures and algorithms that I have studied during my undergrad. The next one, Mobility, seems to be a notch tougher than Aerial Robotics, mainly because of the focus on calculus and physics (neither of which I have touched heavily in years).

Heres the articles this week:

Neural Networks: Software 2.0

In this article from Medium, the Director of AI at Tesla gives a fresh perspective on NNs. He refers to the set of weights in a Neural Network as a program which is learnt, as opposed to coded in by a human. This line of thought is justified by the fact that many decisions in Robotics, Search, etc. are taken by parametric ML systems. He also compares it to traditional ‘Software 1.0’, and points out the benefits of each.

Baselines in Machine Learning

In this article, a senior Research Scientist from Salesforce points out that we need to pay greater attention to baselines in Machine Learning. A baseline is any meaningful ‘benchmark’ algorithm that you would compare your algorithm against. The actual reference point would depend on your task – random/stratified systems for classification, state-of-the-art CNNs for image processing, etc. Read Neal’s answer to this Quora question for a deeper understanding.

The article ends with a couple of helpful tips, such as:

  1. Use meaningful baselines, instead of using very crude code. The better your baseline, the more meaningful your results.
  2. Start off with optimizing the baseline itself. Tune the weights, etc. if you have to – this gives you a good base to start your work on.

TensorFlow Lite

TensorFlow Lite is now in the Developer Preview mode. It is a light-weight platform for inference (not training) using ML models on mobile/embedded devices. Google calls it an ‘evolution of TensorFlow mobile’. While the latter is still the system you should use in production, TensorFlow lite appears to perform better on many benchmarks (Differences here). Some of the major plus-points of this new platform are smaller binaries, and support for custom ML-focussed hardware accelerators via the Android Neural Networks API.

Flatbuffers

Reading up on Tensorflow Lite also brought me to Flatbuffers, which are a ‘liter’ version of Protobufs. Flatbuffer is a data serialization library  for performance-critical applications. Flatbuffers provide the benefits of a smaller memory footprint and lesser generated code, mainly due to skipping of the parsing/unpacking step. Heres the Github repo.

Adversarial Attacks

This YCombinator article gives a nice overview of Adversarial attacks on ML models – attacks that provide ‘noisy’ data inputs to intelligent systems, in order to get a ‘wrong’ output. The author points out how Gradient descent can be used to sort-of reverse engineer spurious noise, in order to get data ‘misclassified’ by a neural network. The article also shows examples of such faulty inputs, and they are surprisingly indistinguishable from the original data!

 

Weekly Review: 11/04/2017

A busy week. I finished my Aerial Robotics course! The next in the Specialization is Computational Motion Planning, which I am more excited about – mainly because the curriculum goes more towards my areas of expertise. Aerial Robotics was challenging primarily because I was doing a lot of physics/calculus which I had not attempted since a long time.

Onto the articles for this week:

Colab is now public!

Google made Colaboratory, a previously-internal tool public. ‘Colab’ is a document-collaboration tool, with the added benefits of being able to run script-sized pieces of code. This is especially useful if you want to prototype small proofs-of-concept, which can then be shared with documentation and demo-able output. I had previously used it within Google to tinker with TensorFlow, and write small scripts for database queries.

Visual Guide to Evolution Strategies

The above link is a great introduction to Evolutionary Strategies such as GAs and CMA-ES. They show a visual representation of how each of these algorithms converges on the optima from the first iteration to the last on simple problems. Its pretty interesting to see how each algorithm ‘broadens’ or ‘focuses’ the domain of its candidate solutions as iterations go by.

Baidu’s Deep Voice

In a 2-part series (Part 1 & Part 2), the author discusses the architecture of Baidu’s Text-to-Speech system (Deep Voice). Take a look if you have never read about/worked on such systems and want to have a general idea of how they are trained and deployed.

Capsule Networks

Geoff Hinton and his team at Google recently discussed the idea of Capsule networks, which try and remedy the rigidity in usual CNNs – by defining groups of specialized neurons called ‘capsules’ whose contribution to higher-level neurons is decided by the similarity of output. Heres a small intro on Capsule Networks, or the original paper if you wanna delve deeper.

Nexar Challenge Results

Nexar released the results of its Deep-Learning challenge on Image segmentation – the problem of ‘boxing’ and ‘tagging’ objects in pictures with multiple entities present. This is especially useful in their own AI-dashboard apps, which need to be quite accurate to prevent possible collisions in deployment.

As further reading, you could also check out this article on the history of CNNs in Image Segmentation, another one on Region-of-Interest Pooling in CNNs, and Deformable Neural Networks. (All of these concepts are mentioned in the main Nexar article)

An introduction to Bayesian Belief Networks

A Bayesian Belief Network (BBN), or simply Bayesian Network, is a statistical model used to describe the conditional dependencies between different random variables.

BBNs are chiefly used in areas like computational biology and medicine for risk analysis and decision support (basically, to understand what caused a certain problem, or the probabilities of different effects given an action).

Structure of a Bayesian Network

A typical BBN looks something like this:

bayesian-networks-a-brief-introduction-7-638

The shown example, ‘Burglary-Alarm‘ is one of the most quoted ones in texts on Bayesian theory. Lets look at the structural characteristics one by one. We will delve into the numbers/tables later.

Directed Acyclic Graph (DAG)

We obviously have one node per random variable.

Directed: The connections/edges denote cause->effect relationships between pairs of nodes. For example Burglary->Alarm in the above network indicates that the occurrence of a burglary directly affects the probability of the Alarm going off (and not the other way round). Here, Burglary is the parent, while Alarm is the child node.

Acyclic: There cannot be a cycle in a BBN. In simple English, a variable A cannot depend on its own value – directly, or indirectly. If this was allowed, it would lead to a sort of infinite recursion which we are not prepared to deal with. However, if you do realize that an event happening affects its probability later on, then you could express the two occurrences as separate nodes in the BBN (or use a Dynamic BBN).

Parents of a Node

One of the biggest considerations while building a BBN is to decide which parents to assign to a particular node. Intuitively, they should be those variables which most directly affect the value of the current node.

Formally, this can be stated as follows: “The parents of a variable X (parents(X)) are the minimal set of ancestors of X, such that all other ancestors of X are conditionally independent of X given parents(X)“.

Lets take this step by step. First off, there has to be some sort of a cause-effect relationship between Y and X for Y to be one of the ancestors of X. In the shown example, the ancestors of Mary Calls are Burglary, Earthquake and Alarm.

Now consider the two ancestors Alarm and Earthquake. The only way an Earthquake would affect Mary Calls, is if an Earthquake causes Alarm to go off, leading to Mary Calls. Suppose someone told you that Alarm has in fact gone off. In this case, it does not matter what lead to the Alarm ringing – since Mary will react to it based on the stimulus of the Alarm itself. In other words, Earthquake and Mary Calls become conditionally independent if you know the exact value of Alarm.

Mathematically speaking, P(Mary Calls|Alarm,Earthquake) == P(Mary Calls|Alarm).

Thus, parents(X) are those ancestors which do not become conditionally independent of X given the value of some other ancestor. If they do, then the resultant connection would actually be redundant.

Disconnected Nodes are Conditionally Independent

Based on the directed connections in a BBN, if there is no way to go from a variable X to Y (or vice versa), then X and Y are conditionally independent. In the example BBN, pairs of variables that are conditionally independent are {Mary Calls, John Calls} and {Burglary, Earthquake}.

It is important to remember that ‘conditionally independent’ does not mean ‘totally independent’. Consider {Mary Calls, John Calls}. Given the value of Alarm (that is, whether the Alarm went off or not), Mary and John each have their own independent probabilities of calling. However, if you did not know about any of the other nodes, but just that John did call, then your expectation of Mary calling would correspondingly increase.

Mathematics behind Bayesian Networks

BBNs provide a mathematically correct way of assessing the effects of different events (or nodes, in this context) on each other. And these assessments can be made in either direction – not only can you compute the most likely effects given the values of certain causes, but also determine the most likely causes of observed events.

The numerical data provided with the BBN (by an expert or some statistical study) that allows us to do this is:

  1. The prior probabilities of variables with no parents (Earthquake and Burglary in our example).
  2. The conditional probabilities of any other node given every value-combination of its parent(s). For example, the table next to Alarm defines the probability that the Alarm will go off given the whether an Earthquake and/or Burglary have occurred.

In case of continuous variables, we would need a conditional probability distribution.

The biggest use of Bayesian Networks is in computing revised probabilities. A revised probability defines the probability of a node given the values of one or more other nodes as a fact. Lets take an example from the Burglary-Alarm BBN.

Suppose we want to calculate the probability that an earthquake occurred, given that the alarm went off, but there was no burglary. Essentially, we want P(Earthquake|Alarm,\sim Burglary). Simplifying the nomenclature a bit, P(E|A,\sim B).

Here, you can say that the Alarm going off (A) is evidence, the knowledge that the Burglary did not happen (\sim B) is context and the Earthquake occurring (E) is the hypothesis. Traditionally, if you knew nothing else, P(E) = 0.002, from the diagram. However, with the context and evidence in mind, this probability gets changed/revised. Hence, its called ‘computing revised probabilities’.

A version of Bayes Theorem states that

P(X|YZ) = \frac{P(X|Z)P(Y|XZ)}{P(Y|Z)} …(1)

where X is the hypothesis, Y is the evidence, and Z is the context. The numerator on the RHS denotes that probability that XY both occur given Z, which is a subset of the probability that atleast Y occurs given Z, irrespective of X.

Using (1), we get

P(E|A, \sim B) = \frac{P(E|\sim B)P(A|\sim B, E)}{P(A|\sim B)} …(2)

Since E and B are independent phenomena without knowledge of A,

P(E|\sim B) = P(E) = 0.002 …(3)

From the table for A,

P(A|\sim B, E) = 0.29 …(4)

Finally, using the Total Probability Theorem,

P(A| \sim B) = P(E) P(A| E, \sim B) + P(\sim E) P(A| \sim E, \sim B) …(5)

Which is nothing but average of P(A| E, \sim B)P(A| \sim E, \sim B), weighted on P(E)P(\sim E) respectively.

Substituting values in (5),

P(A| \sim B) = 0.002 * 0.29 + 0.998 * 0.001 = 0.001578  …(6)

From (2), (3), (4), & (6), we get

P(E|A, \sim B) = 0.367

As you can see, the probability of the Earthquake actually increases if you know that the Alarm went off but a Burglary was not the cause of it. This should make sense intuitively as well. Which brings us to the final part –

The ‘Explain Away’ Effect

The Explain Away effect, commonly associated with BBNs, is a result of computing revised probabilities. It refers to the phenomenon where knowing that one cause has occurred, reduces (but does not eliminate) the probability that the other cause(s) took place.

Suppose instead of knowing that there has been no burglary like in our example, you infact did know that one has taken place. It also led to the Alarm going off. With this information in mind, your tendency to check out the ‘earthquake’ hypothesis reduces drastically. In other words, the burglary has explained away the alarm.

It is important to note that the probability for other causes just gets reduced, but does NOT go down to zero. In a stroke of bad luck, it could have happened that both a burglary and an earthquake happened, and any one of the two stimuli could have led to the alarm ringing. To what extent you can ‘explain away’ an evidence depends on the conditional probability distributions.

Understanding the new Google Translate

screen-shot-2017-01-17-at-4-18-07-pm

Google launched a new version of the Translate in September 2016. Since then, there have been a few interesting developments in the project, and this post attempts to explain it all in as simple terms as possible.

The earlier version of the Translate used Phrase-based Machine Translation, or PBMT. What PBMT does is break up an input sentence into a set of words/phrases and translate each one individually. This is obviously not an optimal strategy, since it completely misses out on the context of the overall sentence. The new Translate uses what Google calls Google Neural Machine Translation (GNMT), an improvement over a traditional version of NMT. Lets see how GNMT works on a high-level:

The Encoder

Before you understand the encoder, you must understand what an LSTM (Long-Short-Term-Memory) cell is. It is basically a Neural Network with some concept of memory. An LSTM is generally used to ‘learn’ patterns in time-series/temporal data. At any given point, it accepts the latest input vector and produces the intended output using a combination of (the latest input + some ‘context’ regarding what it saw before):

main-qimg-0c9574ceedad13aa45f68431c33c6db5

In the above picture, x_t is the input at time t. h_{t-1} represents the context from t-1. If x_t has a dimensionality of d, h_{t-1} of dimensionality 2d is a concatenation of two vectors:

  1. The intended output by the same LSTM at the last time-step t-1 (the Short Term memory), and
  2. Another d-dimensional vector encoding the Long Term memory – also called the Cell State.

The second part is usually not of use for the next component in the architecture. It is instead used by the same LSTM for the following step. LSTMs are usually trained by providing them with a ton of example input-series with the expected outputs. This enables them to learn what parts of the input to retain/hold, and how to mathematically process x_t and h_{t-1} to come up with h_t. If you wish to understand LSTMs better, I recommend this blog post by Christopher Olah.

An LSTM can also be ‘unfolded’, as shown below:

5655a235588a8

Don’t worry, they are copies of the the same LSTM cell (hence same training), each feeding their output to the next one in line. What this allows us to do is give in the entire set of input vectors (in essence, the whole time-series) all at once, instead of going step-by-step with a single copy of the LSTM.

GNMT’s encoder network is essentially a series of stacked LSTMs:

screen-shot-2017-01-16-at-4-17-00-pm

Each horizontal line of pink/green boxes is an ‘unfolded’ LSTM on its own. The above figure therefore has 8 stacked LSTMs in a series. The input to the whole architecture is the ordered set of tokens in the sentence, each represented in the form of a vector. Mind you, I said tokens – not words. What GNMT does in pre-processing, is break up all words into tokens/pieces, which are then fed as a series to the neural network. This enables the framework to (atleast partially) understand unseen complicated words. For example, suppose I say the word ‘Pteromerhanophobia‘. Even though you may not know exactly what it is, you can tell me that it is some sort of fear based on the token ‘phobia‘. Google calls this approach Wordpiece modeling. The break-up of words into tokens is done based on statistical learning (which group of tokens make most sense?) from a huge vocabulary in the training phase.

When you stack LSTMs, each layer learns a pattern in the time series fed to it by the earlier (lower) layer. As you go higher up the ladder, you see more and more abstract patterns from the data that was fed in to the lowest layer. For example, the lowest layer might see a set of points and deduce a line, the next layer will see a set of lines and deduce a polygon, the next will see a set of polygons and learn an object, and so on… Ofcourse, there is a limit to how many and in what way you should stack LSTMs together – more is not always better, since you will ultimately end up with a model thats too slow and difficult to train.

There are a few interesting things about this architecture shown above, apart from the stacking of LSTMs.

screen-shot-2017-01-17-at-3-59-22-pm

You will see that the second layer from the bottom is green in color. This is because the arrows – the ordering of tokens in the sentence – is reversed for this layer. Which means that the second LSTM sees the entire sentence in reverse order. The reason to do this is simple: When you look at a sentence as a whole, the ‘context’ for any word is not just contained in the words preceding it, but also in the words following it. The two bottom-most layers both see the raw sentence as input, but in opposite order. The third LSTM gets this bidirectional input from the first two layers – basically, a combination of the forward and backward context for any given word. Each layer from this point on learns higher-level patterns in the contextual meanings of words in the sentence.

You might also have noticed the ‘+’ signs that appear before providing inputs to the fifth layer and above. This is a form of Residual Learning. This is what happens from layer 5 onwards: For every layer N+1, the input is an addition of the output of layers N and N-1. Take a look at my post on Residual Neural Networks to get a better understanding of what this does.

screen-shot-2017-01-17-at-4-01-09-pm

Lastly, you can see the extra <2es> and </s> characters at the end of the input to the encoder. </s> represents ‘end of input’. <2es>, on the other hand, represents the Target Language – in this case, Spanish. GNMT does this unique thing where they provide the Target Language as input to the framework, to improve performance of Translate. More on this later.

Attention Module and the Decoder

screen-shot-2017-01-17-at-4-07-07-pm

The Encoder produces a set of ordered output-vectors (one for each token in the input). These are then fed into the Attention Module & Decoder framework. To a large extent, the Decoder is similar to the Encoder in design- stacked LSTMs and residual connections. Lets discuss the parts that are different.

I have already mentioned that GNMT considers the entire sentence as input, in every sense. However, it is intuitive to think that for every token that the decoder will produce, it should not give equal weightage to all vectors(tokens) in the input sentence. As you write out one part of the story, your focus should slowly drift to the rest of it. This work is done by the Attention Module. What the Attention Module gets as input, is the complete output of the Encoder and the latest vector from the Decoder stack. This lets it ‘understand’ how much/what has already been translated, and it then directs the Decoder to shift attention to the other parts of the Encoder output.

The Decoder LSTM-stack keeps outputting vectors based on the input from the Encoder and directions from the Attention module. These vectors are given to the Softmax Layer. You can think of the Softmax Layer as a Probability distribution-generator. Based on the incoming vector from the topmost LSTM, the Softmax Layer assigns a probability to every possible output token (remember the target language was already provided to the Encoder, so that information has already been propagated). The token that gets the maximum probability is written out.

The whole process stops once the Decoder/Softmax decides that the current token is </s> (or end-of-sentence). Note that the Decoder does not have to follow a number of steps equal to the output vectors from the Encoder, since it is paying weighted attention to all of those at every step of computation.

Overall, this is how you can visualize the  complete translation process:

nmt-model-fast

Training & Zero-Shot Translation

The complete framework (Encoder+Attention+Decoder) is trained by providing it a huge collection of (input, translated) pairs of sentences. The architecture ‘knows’ the input language in a sense when it converts tokens from the incoming sentence to the appropriate vector format. The target language is provided as a parameter as well. The brilliance of deep-LSTMs lies in the fact that the neural network learns all of the computational stuff by itself, using a class of algorithms called Backpropagation/Gradient Descent.

image01

Heres another amazing discovery made by the GNMT team: Simply by providing the target language as an input to the framework, it is able to perform Zero-Shot Translation! What this basically means is: If during training you provide it examples of English->Japanese & English->Korean translations, GNMT automatically does Japanese->Korean reasonably well! In fact, this is the biggest achievement of GNMT as a project. The intuition: what the Encoder essentially produces is a form of interlingua (or universal language). Whenever I say ‘dog‘ in any language, you end up thinking of a friendly canine – essentially, the concept of ‘dog‘. This ‘concept’ is what is produced by the Encoder, and it is irrespective of any language. In fact, some articles went so far as to say that Google’s AI had invented a language of its own :-D.

Providing the target language as input allows GNMT to easily use the same neural network for training with any pair of languages, which in turn allows zero-shot translations. As a result, the new Translate gets closer than ever before to the way humans perform translations in their mind.

Heres some references if you want to read further on this subject 🙂 :

  1. First blog post about GNMT on the Google Research Blog. (Corresponding Research Paper)
  2. Second blog post about Zero-Shot Translations. This one made the biggest splash. (Corresponding Research Paper)
  3. A great NYTimes article that tells the story behind this Google Translate.

On Interpretable Models

Artificial Intelligence is everywhere today. And as intelligent systems get more ubiquitous, the need to understand their behavior becomes equally important. Maybe if you are developing an algorithm to recognize a cat-photo for fun, you don’t need to know how it works as long as it delivers the results. But if you have deployed a model to predict whether a person will default on a loan or not, and you use it to make your decisions, you better be sure you are doing the right thing – for practical, ethical AND legal reasons.

From Dictionary.com,

interpretabilityto give or provide the meaning of; explain; explicate; elucidate

 

Why do models need to be interpretable?

The primary reason why we need explainability in AI, is to develop a sense of understanding and trust. Think about it – the only way you would ever delegate an important task to someone else, is if you had a certain level of trust in their thought process. If for instance Siri makes an error in understanding your commands, thats fine. But now consider self-driven cars. The reason why most people would not readily sit in a self-driven car within a city, is because we cannot guarantee if it will do the right thing in every situation. Interpretability is thus crucial for building trust towards models, especially in domains like healthcare, finance and the judicial system.

Interpretability is also important while debugging problems in a model’s performance. These problems might be caused due to the algorithm itself, or the data being used to train it. And you may not really observe these issues until you deploy a compound system that uses this model. Lets take the example of Google’s Word2Vec. Word2Vec is currently one of the best algorithms for computing word-embeddings given a significant amount of text. It was originally trained on a 3 million-word corpus of Google News articles. In some research conducted by people from Boston university and Microsoft Research, they found a ton of hidden sexism in the word-embeddings generated from that dataset. For example, the framework came up with this particular analogy: “man : computer programmer :: woman : homemaker”. Funny ain’t it? This was not a problem with the algorithm itself, but a screw-up of the way news articles are usually written. Quoting the source, “Any bias contained in word embeddings like those from Word2vec is automatically passed on in any application that exploits it.”.

How do we increase interpretability of models?

There are two ways to promote interpretability when it comes to Machine Learning/AI systems: Transparency, and Post-Hoc Explainability. Algorithmic transparency would mean that you understand the way your model works on an intuitive level, with respect to the dataset you used for training. A Decision Tree, for example, is pretty transparent – in fact, you can use the paths from the root to every leaf node to decompose the Tree into the set of rules used for classification. But a deep Neural Network is not so transparent, for obvious reasons. Though you may understand linear algebra and back-propagation, you will typically not be able to make sense of the weights/biases learned by a deep-NN after training.

Transparency has two aspects: Decomposability, and Simultaneity. Decomposability would mean understanding each individual component of your model. In essence, there should not be a ‘black-box’ component of the system in your eyes. Simultaneity, on the other hand, indicates an understanding of how all these individual components work together as a whole. And the former does not necessarily imply the latter – consider an algorithm as simple as linear regression. You would probably know that if the weight with respect to a predictor is positive after training, it shows a direct proportionality to the target variable. Now, if you train a simple linear regression of Disease-risk vs Vaccination, you would most probably get a negative weight on the Vaccination variable. But if you now take tricky factors such as immunodeficiency or age (old-age or infancy) into the picture, the weight might take on a whole different value. In fact, as the number of predictor variables goes on increasing in regression, it gets more and more difficult to understand how your model will behave as a whole. And thus, the notion that a ‘simple’ model (linear regression) would be far easier to interpret than a ‘complex’ one (deep learning) is misleading.

Post-Hoc means ‘occurring after an event’. In the context of model transparency, post-hoc interpretation would mean an effort to understand its behavior after it has finished training, typically using some test inputs. Some models/algorithms inherently have the ability to ‘explain’ their behavior. For example, take k-NN classifiers. Along with the required output, you can hack (with minimal effort) the model to return the k-nearest neighbors as examples for scrutiny. This way, you get a good idea of the combination of properties that produce similar results by looking at the known training points.

Most algorithms don’t have such easy post-hoc interpretability, though. In such cases, you have to use techniques such as visualization to understand how they behave/work. For instance, you could use a dimensionality reduction technique such as t-SNE to reduce vector points to 2-3 dimensions and visualize class ‘regions’ in 2D/3D space. Essentially, you are enabling easy visualization of higher-dimensional data by embedding it in a lower-dimensional space. Saliency maps are a technique used to interpret deep neural networks. In Natural Language Processing, textual explanations are also being adopted. Since humans usually understand words better than raw numbers, providing text-based explanations makes sense. For example, in a system like LSI, you could ‘understand’ a word’s embedding by (proportionately) looking at the words that strongly belong to the latent topic(s) is most relates to.

Conclusion and further reading

I did kind-of imply that interpretability is required so that we end up trusting automated systems as much as we trust humans. But as it turns out, its not like human actions are perfectly explainable. There is a ton of research in psychology that clearly indicates that the motivations for our actions are not as clear as we ourselves tend to believe. The Illusion of Conscious Will by Daniel Wegner talks about how our decisions tend to be influenced by subconscious processes without us realizing it. Moreover, it seems contradictory to the ultimate aim of AI to avoid building models that we cannot ‘understand’. If there will be machine intelligence smarter than us, the likelihood of us understanding it completely is pretty slim (Terminator, anyone?).

Heres a couple of links for you to look at, if you want to read more:

  1. EU’s General Data Protection Regulation (GDPR) makes interpretability a top priority.
  2. David Gunning’s summarization of the work on Explainable AI at the International Joint Conference on Artificial Intelligence.
  3. “Is Artificial Intelligence Permanently Inscrutable?” by Aaron Bornstein (Has some good discussion in the Comments section)

Non-Mathematical Feature Engineering techniques for Data Science

“Apply Machine Learning like the great engineer you are, not like the great Machine Learning expert you aren’t.”

This is the first sentence in a Google-internal document I read about how to apply ML. And rightly so. In my limited experience working as a server/analytics guy, data (and how to store/process it) has always been the source of most consideration and impact on the overall pipeline. Ask any Kaggle winner, and they will always say that the biggest gains usually come from being smart about representing data, rather than using some sort of complex algorithm. Even the CRISP data mining process has not one, but two stages dedicated solely to data understanding and preparation.

So what is Feature Engineering?

Simply put, it is the art/science of representing data is the best way possible.

Why do I say art/science? Because good Feature Engineering involves an elegant blend of domain knowledge, intuition, and basic mathematical abilities. Heck, the most effective data representation ‘hacks’ barely involve any mathematical computation at all! (As I will explain in a short while).

What do I mean by ‘best’? In essence, the way you present your data to your algorithm should denote the pertinent structures/properties of the underlying information in the most effective way possible. When you do feature engineering, you are essentially converting your data attributes into data features.

Attributes are basically all the dimensions present in your data. But do all of them, in the raw format, represent the underlying trends you want to learn in the best way possible? Maybe not. So what you do in feature engineering, is pre-process your data so that your model/learning algorithm has to spend minimum effort on wading through noise. What I mean by ‘noise’ here, is any information that is not relevant to learning/predicting your ultimate goal. In fact, using good features can even let you use considerably simpler models since you are doing a part of the thinking yourself.

But as with any technique in Machine Learning, always use validation to make sure that the new features you introduce really do improve your predictions, instead of adding unnecessary complexity to your pipeline.

the-how-and-why-of-feature-engineering-5-638

As mentioned before, good feature engineering involves intuition, domain knowledge (human experience) and basic math skills. So heres a few extremely simple techniques for you to (maybe) apply in your next data science solution:

1. Representing timestamps

Time-stamp attributes are usually denoted by the EPOCH time or split up into multiple dimensions such as (Year, Month, Date, Hours, Minutes, Seconds). But in many applications, a lot of that information is unnecessary. Consider for example a supervised system that tries to predict traffic levels in a city as a function of Location+Time. In this case, trying to learn trends that vary by seconds would mostly be misleading. The year wouldn’t add much value to the model as well. Hours, day and month are probably the only dimensions you need. So when representing the time, try to ensure that your model does require all the numbers you are providing it.

And not to forget Time Zones. If your data sources come from different geographical sources, do remember to normalize by time-zones if needed.

2. Decomposing Categorical Attributes

Some attributes come as categories instead of numbers. A simple example would be a ‘color’ attribute that is (say) one of {Red, Green, Blue}. The most common way to go about representing this, is to convert each category into a binary attribute that takes one value out of {0, 1}. So you basically end up with a number of added attributes equal to the number of categories possible. And for each instance in your dataset, only one of them is 1 (with the others being 0). This is a form of one-hot encoding.

If you are new to this idea, you may think of decomposition as an unnecessary hassle (we are essentially bloating up the dimensionality of the dataset). Instead, you might be tempted to convert the categorical attribute into a scalar value. For example, the color feature might take one value from {1, 2, 3}, representing {Red, Green, Blue} respectively. There are two problems with this. First, for a mathematical model, this would mean that Red is somehow ‘more similar’ to Green than Blue (since |1-3| > |1-2|). Unless your categories do have a natural ordering (such as stations on a railway line), this might mislead your model. Secondly, it would make statistical metrics (such as mean) meaningless – or worse, misleading yet again. Consider the color example again. If your dataset contains equal numbers of Red and Blue instances but no Green ones, the ‘average’ value of color might still come out to be ~2 – essentially meaning Green!

The safest place to convert a categorical attribute into a scalar, is when you have only two categories possible. So you have {0, 1} corresponding to {Category 1, Category 2}. In this case, an ‘ordering’ isn’t really required, and you can interpret the value of the attribute as the probability of belonging to Category 2 vs Category 1.

3. Binning/Bucketing

Sometimes, it makes more sense to represent a numerical attribute as a categorical one. The idea is to reduce the noise endured by the learning algorithm, by assigning certain ranges of a numerical attribute to distinct ‘buckets’. Consider the problem of predicting whether a person owns a certain item of clothing or not. Age might definitely be a factor here. What is actually more pertinent, is the Age Group. So what you could do, is have ranges such as 1-10, 11-18, 19-25, 26-40, etc. Moreover, instead of decomposing these categories as in point 2, you could just use scalar values, since age groups that lie ‘closer by’ do represent similar properties.

Bucketing makes sense when the domain of your attribute can be divided into neat ranges, where all numbers falling in a range imply a common characteristic. It reduces overfitting in certain applications, where you don’t want your model to try and distinguish between values that are too close by – for example, you could club together all latitude values that fall in a city, if your property of interest is a function of the city as a whole. Binning also reduces the effect of tiny errors, by ’rounding off’ a given value to the nearest representative. Binning does not make sense if the number of your ranges is comparable to the total possible values, or if precision is very important to you.

4. Feature Crosses

This is perhaps the most important/useful one of these. Feature crosses are a unique way to combine two or more categorical attributes into a single one. This is extremely useful a technique, when certain features together denote a property better than individually by themselves. Mathematically speaking, you are doing a cross product between all possible values of the categorical features.

Consider a feature A, with two possible values {A1, A2}. Let B be a feature with possibilities {B1, B2}. Then, a feature-cross between A & B (lets call it AB) would take one of the following values: {(A1, B1), (A1, B2), (A2, B1), (A2, B2)}. You can basically give these ‘combinations’ any names you like. Just remember that every combination denotes a synergy between the information contained by the corresponding values of A and B.

For example, take the diagram shown below:

svm_nonlinear_2class_quadrant

All the blue points belong to one class, and the red ones belong to another. Lets put the actual model aside. First off, you would benefit from binning the X, Y values into {x < 0, x >= 0} & {y < 0, y >= 0} respectively. Lets call them {Xn, Xp} and {Yn, Yp}. It is pretty obvious that Quadrants I & III correspond to class Red, and Quadrants II & IV contain class Blue. So if you could now cross features X and Y into a single feature ‘Quadrant’, you would basically have {I, II, III, IV} being equivalent to {(Xp, Yp), (Xn, Yp), (Xn, Yn), (Xp, Yn)} respectively.

A more concrete/relatable example of a (possibly) good feature cross is something like (Latitude, Longitude). A common Latitude corresponds to so many places around the globe. Same goes for the Longitude. But once you combine Lat & Long buckets into discreet ‘blocks’, they denote ‘regions’ in a geography, with possibly similar properties for each one.

Sometimes, attributes can also be ‘combined’ into a single feature with simple mathematical hacks. In the above example, suppose you define modified features X_{sign} and Y_{sign} as follows:

X_{sign} = \frac{x}{|x|}

Y_{sign} = \frac{y}{|y|}

Now, you could just define a new feature Quadrant_{odd} as follows:

Quadrant_{odd} = X_{sign}Y_{sign}

Thats all! If Quadrant_{odd} = 1, the class is Red. Else, Blue!

 

For sake of completeness, I will also mention some mathematically intensive feature engineering techniques, with links for you to read more about them:

5. Feature Selection : Using certain algorithms to automatically select a subset of your original features, for your final model. Here, you are not creating/modifying your current features, but rather pruning them to reduce noise/redundancy.

6. Feature Scaling : Sometimes, you may notice that certain attributes have a higher ‘magnitude’ than others. An example might be a person’s income – as compared to his age. In such cases, for certain models (such as Ridge Regression), it is infact necessary that you scale all your attributes to comparable/equivalent ranges. This prevents your model from giving greater weightage to certain attributes as compared to others.

7. Feature Extraction : Feature extraction involves a host of algorithms that automatically generate a new set of features from your raw attributes. Dimensionality reduction falls under this category.

Backpropagation for dummies

Backpropagation is a method of training an Artificial Neural Network. If you are reading this post, you already have an idea of what an ANN is. However, lets take a look at the fundamental component of an ANN- the artificial neuron.

actfn001

The figure shows the working of the ith neuron (lets call it N_i) in an ANN. Its output (also called its activation) is a_i. The jth neuron N_j provides one of the inputs to N_i.

How does N_i produce its own activation?

1. N_i has stored in itself, a weightage assigned to each of its inputs. Lets say that the weightage assigned by N_i to N_j is w_{i, j}. As the first step, N_i computes a weighted sum of all its inputs. Lets call it in_i. Therefore,

in_i = \sum_{k \in Inputs(N_i)} w_{i, k}a_k … (1)

where Inputs(N_i) corresponds to all neurons that provide an input to N_i.

2. This sum, called the input function, is then passed on to N_i‘s activation function. It is called so, because it computes the activation that will be provided by N_i to the units it in turn acts as an input to. Therefore,

a_i = g(in_i) … (2)

Usually, the activation functions of all neurons in an ANN are the same, so we just write g instead of g_i.

There are many options for g, such as linear functions, rectified linear functions, hyperbolic tan, etc. For the sake of this tutorial, we will use the sigmoid function (I’ll explain why later). So we have

g(x) = \frac{1}{1 + e^{-x}} … (3)

 

Thus, every neuron in an ANN – except the ones in the lowest-level input layer, and bias neurons – takes in the activation of all its inputs, and provides its own activation as an output.

Once a neural network has been designed, it is the job of the training process to ensure that the individual weightages w_{i, j} given by each neuron to each of its inputs is set just right, so that the whole network gives the right output. Backpropagation is one of the ways to optimize those weights.

Whenever we talk of optimization, it is necessary to properly define what we are optimizing and how we are doing it.

What are we trying to optimize?

The error function. This function defines how much the output of our model (in this case an ANN) defers from the required output. For the sake of this tutorial, we will use the Mean-Squared Error(MSE) function. For example, let N_o be one of the output neurons of our ANN. For a given training sample, let the expected output be t_o. However, N_o provides an output equal to its activation value, that is a_o. Therefore, the error as defined by MSE would be

E = \frac{1}{2} (t_o - a_o)^2 … (4)

Since a_o by its definition depends on the weightages given by N_o to its inputs, the ultimate valye of E depends on those weightages too.

It is important to remember that every training sample will result in a different error value. It is backpropagation’s goal to minimize (bring as close to zero as possible) the value of E for all training samples on an average. The way this is done, as mentioned before, is by fine-tuning the values of the weights w_{i, j}.

But how does backpropagation fine tune these weightage values? By using a technique called Gradient Descent.

What is Gradient Descent?

Consider a function as follows:

f(x) = 3x + 4

The gradient of the function f with respect to the variable x defines how much the value of f will change with a unit increase/decrease in the value of x. For the above example, the answer to that is 3. Basically, you are differentiating f with respect to x. Generally speaking, a function may be dependent on a variety of parameters, so you take a partial derivative when computing the gradient.

Consider you are on a hill like the one shown in the image below:

Gradient_descent_1The red regions correspond to places of higher function value, while the blue regions correspond to low values. You want to reach the lowest point possible. If you are standing on the red hill, and you can only see a small part of the terrain around you, what will you do? You will naturally take steps in the direction in which the terrain slopes down. With each step you take, you will scan your immediate neighbourhood (which is visible to you), and go in the direction that shows the steepest descent.

In mathematical terms, this ‘direction’ is nothing but the gradient of your height (the target function you are trying to minimize), with respect to your location (x, y) (which are the parameters you are fine tuning). The overall direction in which you move, will be dependent on the gradients with respect to both dimensions, i.e. \frac{\partial height}{\partial x} and \frac{\partial height}{\partial y} . That’s pretty much what gradient descent does.

Its called so, because you are descending from a higher value of the target function to a lower value, by following the direction that corresponds to the lowest gradient (and consequently the steepest decrease in function value).

One thing to note is that you would always want to move in the direction shown by the negation of the gradient. Consider the above example of f(x) = 3x + 4 . The gradient here is 3, which is positive. So, with every unit increase in x, f(x) would increase by 3. Therefore, to reduce f(x), you would decrease the value of x . The opposite is true in scenarios where you want to maximize the target function. In those cases, you move in the direction of the gradient itself.

Coming back to the context of ANN training, backpropagation reduces the error function E by taking help of the gradients with respect to individual weights, \frac{\partial E}{\partial w_{i, j}} .

How does backpropagation employ gradient descent?

First off, it is necessary to keep in mind that backpropagation tries to optimize the inter-neuron weights, and not the neurons themselves. Their functionality is pretty much decided when you decide the activation function.

Secondly, the error function value E depends on the current values of the weights and the training sample in question, apart from the definition of the activation function. So to compute the error with respect to a training sample, you first feed it as an input to the network. You will then receive the appropriate output(s) from the output neuron(s). This part is known as  ‘forward propagation‘, since you are going from the input layer to the output layer (through the hidden layers if any). If you have multiple output neurons, the error values corresponding to each will affect the weights appropriately.

Then, you use the network output(s) along with the expected value(s) to calculate the errors using Equation 4 given above. Once the error values are known, the weights are updated using the gradients. This updating occurs first for the out-most weights, and then for lower layers. The intuition behind this, is the fact that the errors with respect to the outmost neurons are known first, and you derive the errors of the inner neurons from them. Each layer’s errors are calculated from the errors of the neurons they provide input to. That is why the second phase is called ‘backward propagation‘, or back-propagation.

So as a first step, lets try to understand how we will calculate the gradient \frac{\partial E}{\partial w_{i, j}} for a given training sample. Remember that  w_{i, j} is the weightage given by N_i to N_j.

Using the chain rule in Leibniz’s form,

\frac{\partial E}{\partial w_{i, j}} = \frac{\partial E}{\partial in_i} \frac{\partial in_i}{\partial w_{i, j}} … (5)

From Equations 1 and 5, and using basic differentiation,

\frac{\partial E}{\partial w_{i, j}} = a_j \frac{\partial E}{\partial in_i} … (6)

From the basic definition of the activation function (and the appropriate current weight values), you already know the value of a_j. Now consider the second term in the above product. Using the chain rule again on it, we get

\frac{\partial E}{\partial in_i} = \frac{\partial E}{\partial a_i} \frac{\partial a_i}{\partial in_i} … (7)

This is where the whole advantage of using the sigmoid activation function comes into play. The derivative of a sigmoid function with respect to its input, corresponds to a nice expression. Given equation 3, you can easily verify that

\frac{dg}{dx} = g(x)(1 - g(x))  … (8)

From Equations 7 and 8,

\frac{\partial E}{\partial in_i} = a_i (1 - a_i) \frac{\partial E}{\partial a_i} … (9)

Putting the information in Equations 6 and 9 together,

\frac{\partial E}{\partial w_{i, j}} = a_j a_i (1 - a_i) \frac{\partial E}{\partial a_i} … (10)

Look at the last term. There are two possible cases for N_i (N_i cannot be an input neuron, since it is receiving input from another neuron) :

I. N_i is an output neuron. In this case, the second term on the RHS is pretty simple, from Equation 4-

\frac{\partial E}{\partial a_i} = - (t_i - a_i) … (11)

where t_i is the expected output from N_i (for the current training sample).

2. N_i is an inner neuron. Consider all neurons N_k, for whom N_i acts an an input. Since we are propagating backward, you can therefore assume that we know the values \frac{\partial E}{\partial a_k} for all of them. In fact, since the output of N_i affects the outputs of all these N_k s, the expression for  \frac{\partial E}{\partial a_i} will involve all of them.

Using simple chain rule,

\frac{\partial E}{\partial a_i} = \sum_{k} \frac{\partial in_k}{\partial a_i} \frac{\partial E}{\partial in_k} … (12)

From Equations 12, 1 and 9,

\frac{\partial E}{\partial a_i} = \sum_{k} w_{k, i} a_k (1 - a_k) \frac{\partial E}{\partial a_k} … (13)

Equations 10, 11 and 13 pretty much define the gradient values \frac{\partial E}{\partial w_{i, j}} for all weights in a given ANN. So how are the weights actually updated?

Using the following rule:

The change caused by a given training sample, at time instance ‘t’, for the weight w_{i, j} is given by

\Delta w_{i, j}^{t} = - \epsilon \frac{\partial E}{\partial w_{i, j}} + \alpha \Delta w_{i, j}^{t - 1}

Remember me mentioning that we always need to move in the direction of the gradient’s negation? Thats why the minus sign for the first term. The value \epsilon is called the learning rate. We should never be too hasty in changing the weight’s value too drastically based on the gradient with respect to one training sample. Therefore, the gradient value is multiplied by the learning rate (usually well below 0.5) before causing the required change in the weightage value. If the learning rate is too low, the algorithm takes too long to train. On the other hand, if its too high, the algorithm keeps ‘bouncing around’ in the search space. Imagine you are running towards point B from point A. But you run so fast that you cant stop in time at point B, but you actually end up far ahead of it. Then you again run in the opposite direction towards B, once again crossing it and stopping ahead because of your high speed. Thats pretty much what happens if your learning rate is too large.

A variant of backpropagation, called resilient propagation(RPROP) improves on this classic method by having custom deltas with respect to each weightage in a network. It only uses the gradients as references to determine which direction to move in. Therefore, you don’t have to provide the learning rate as a parameter to resilient propagation.

Now look at the second term in the RHS above. The value \alpha is called the momentum. Sometimes, gradient descent tends to get stuck in local optima in the search space. A local optimum is a point where the value is low compared to the region around it, but not all of the domain overall. The momentum term causes the current weight update to be dependent on the last weight update (and therefore not be drastically different), thus preventing getting stuck in local optima in most cases. If it is indeed a global optimum, the search will eventually come back to the required point as the momentum will go on reducing. Nesterov Momentum is another way of optimizing the use of momentum in gradient descent.

How frequently are the weights updated?

There are 3 answers to this:

1. Online Mode

In this mode, the weights are updated after every training sample. In other words, the forward propagation as well as the backward propagation phases are run for each training sample, every single time. This mode is appropriate if you are receiving your training data sequentially. However, if your training set is large, this mode of training can be pretty time-consuming. Moreover, it is not appropriate to update weights after every sample (in most cases), due to the possible presence of outliers.

2. Batch Mode

In this mode, the weights are updated after accumulating gradients with respect to all samples in the training set. In other words, the forward propagation is run with respect to all samples, and the backward phase is run just once with respect to the accumulated results of the entire batch. This is done for multiple iterations (obviously). The batch mode of training is thus more resistant to variance in the training dataset.

3. Stochastic Mode

The stochastic mode adds randomization to the mix. In the mini-batch method of stochastic learning, small sets of samples are chosen randomly from the dataset, and training is done in batch mode over these mini-batches. Stochastic mode with mini-batches provides a nice trade-off between the pros and cons of online and batch modes of training. Moreover, since you are randomly choosing the mini batches (instead of organizing your dataset into pre-defined subsets), you avoid getting stuck in local optima. This is mainly because you are providing weight updates with respect to a potentially different subset of training samples during every iteration.

EDIT 14/01/2016

Xavier’s initialization

Usually, the weights in a Neural Network are initialized to random values prior to training. This is okay if your network is shallow. But for deep learning applications, where the number of hidden layers may be 3 or more, a wrong selection of initial weights can cause convergence to sub-optimal models.

For example, assume that you are using a sigmoid activation function and your initial weights are pretty small (in the 0-1 range). If you manually calculate the activation values in a neural network across 1-2 layers, you will realize that almost all neurons in the topmost layer give activations around 0.5 (the mean value for the sigmoid function). This phenomenon is called ‘saturation’. Its described well in this paper by Glorot and Bengio.

The solution to this problem, is Xavier’s initialization.  This technique tries to optimize the generation of initial weights, by applying simple statistics over the number of input and output connections corresponding to a neuron. A good intuitive explanation of the method is given at Andy’s blog. As a note to my future self, here’s the TensorFlow implementation of Xavier’s algorithm.

Generating rudimentary Mind-Maps from Word2Vec models

Mind Maps are notorious for being a very powerful organizational tool for a variety of tasks, such as brainstorming, planning and problem solving. Visual (or rather, graphical) arrangement of ideas aids the thought process, and in fact mimics the way we explore our mental knowledge base while ‘thinking’. There are a lot of online tools available for drawing out mind maps, but none that can generate one. By generate, I mean coming up with the verbal content.

For the longest time (almost 8 months now), I have been tinkering with ways to combine text mining and graph theory into a framework to generate a Mind-Map (given a text document). Ofcourse, the first thing argument would be that there cannot be a single possible Mind-Map for any block of text. And its true! However, having such an automated Map as a reference while building your own, might give you more insights (especially while brainstorming), or help you remember links that you might miss out (for studying). Lets see what a Mind-Map looks like:

laws_mindmap

Two points:

i. A Mind-Map is NOT a tree that divides the overall topic into its subtopics recursively. Its infact more like a graph, that links terms that are semantically related.

ii. Like ‘Night’ might make the word ‘Day’ pop up in your mind, a mind-map may have links between opposite meaning concepts, like Thicker-Thinner in the above example.

There are of course other points like using images to enhance concepts, and so on. But thats not the point of this post (And I suck at designer-style creativity anyways). Heres an article to help you get acquainted with the process of building and using your own Mind-Maps, just in case.

In my last post, I described a method to generate a Word2Vec model from a text document (where I used Wikipedia articles as an example). I will now describe the methodology I followed to generate a rudimentary mind-map from a Wikipedia article’s Word2Vec model model.

Step 1: Figuring out the top n terms from the document

(As I mentioned in my previous post, I only use stemmed unigrams. You can of course use higher-order ngrams, but that makes things a little tricky (but more accurate, if your algorithms for n-gram generation are solid).)

Here, n denotes the number of ‘nodes’ in my Mind-Map. In my trial-and-errors, 50 is usually a good enough number. Too less means too little information, and too much would mean noisy mind-maps. You can obviously play around with different choices of n. I use the co-occurrence based technique described in this paper to list out the top n words in a document. Heres the Python code for it:


def _get_param_matrices(vocabulary, sentence_terms):
    """
    Returns
    =======
    1. Top 300(or lesser, if vocab is short) most frequent terms(list)
    2. co-occurence matrix wrt the most frequent terms(dict)
    3. Dict containing Pg of most-frequent terms(dict)
    4. nw(no of terms affected) of each term(dict)
    """

    #Figure out top n terms with respect to mere occurences
    n = min(300, len(vocabulary))
    topterms = list(vocabulary.keys())
    topterms.sort(key = lambda x: vocabulary[x], reverse = True)
    topterms = topterms[:n]

    #nw maps term to the number of terms it 'affects'
    #(sum of number of terms in all sentences it
    #appears in)
    nw = {}
    #Co-occurence values are wrt top terms only
    co_occur = {}
    #Initially, co-occurence matrix is empty
    for x in vocabulary:
        co_occur[x] = [0 for i in range(len(topterms))]

    #Iterate over list of all sentences' vocabulary dictionaries
    #Build the co-occurence matrix
    for sentence in sentence_terms:
        total_terms = sum(list(sentence.values()))
        #This list contains the indices of all terms from topterms,
        #that are present in this sentence
        top_indices = []
        #Populate top_indices
        top_indices = [topterms.index(x) for x in sentence
                       if x in topterms]
        #Update nw dict, and co-occurence matrix
        for term in sentence:
            nw[term] = nw.get(term, 0) + total_terms
            for index in top_indices:
                co_occur[term][index] += (sentence[term] *
                                          sentence[topterms[index]])

    #Pg is just nw[term]/total vocabulary of text
    Pg = {}
    N = sum(list(vocabulary.values()))
    for x in topterms:
        Pg[x] = float(nw[x])/N

    return topterms, co_occur, Pg, nw


def get_top_n_terms(vocabulary, sentence_terms, n=50):
    """
    Returns the top 'n' terms from a block of text, in the form of a list,
    from most important to least.

    'vocabulary' should be a dict mapping each term to the number
    of its occurences in the entire text.
    'sentence_terms' should be an iterable of dicts, each denoting the
    vocabulary of the corresponding sentence.
    """

    #First compute the matrices
    topterms, co_occur, Pg, nw = _get_param_matrices(vocabulary,
                                                     sentence_terms)

    #This dict will map each term to its weightage with respect to the
    #document
    result = {}

    N = sum(list(vocabulary.values()))
    #Iterates over all terms in vocabulary
    for term in co_occur:
        term = str(term)
        org_term = str(term)
        for x in Pg:
            #expected_cooccur is the expected cooccurence of term with this
            #term, based on nw value of this and Pg value of the other
            expected_cooccur = nw[term] * Pg[x]
            #Result measures the difference(in no of terms) of expected
            #cooccurence and  actual cooccurence
            result[org_term] = ((co_occur[term][topterms.index(x)] -
                                 expected_cooccur)**2/ float(expected_cooccur))

    terms = list(result.keys())
    terms.sort(key=lambda x: result[x],
               reverse=True)

    return terms[:n]

The get_top_n_terms function does the job, and I guess the docstrings and in-line comments explain how (combined with the paper, of course). If you have the patience and time, you can infact just see the entire vocabulary of your Word2Vec model and pick out those terms that you want to see in your Mind-Map. This is likely to give you the best results (but with a lot of efforts).

Step 2: Deciding the Root

The Root would be that term out of your nodes, which denotes the central idea behind your entire Mind-Map. Since the number of nodes is pretty small compared to the vocabulary, its best to pick this one out manually. OR, you could use that term which has the highest occurrence in the vocabulary, among the selected nodes. This step may require some trial and error (But then what part of data science doesn’t?).

Step 3: Generating the graph (Mind-Map)

This is of course the most crucial step, and the one I spent the most time on. First off, let me define what I call the contextual vector of a term.

Say the root of the Mind Map is ‘computer’. It is linked to the term ‘hardware’. ‘hardware’ is in turn linked to ‘keyboard’. The Word2Vec vector of ‘keyboard’ would be obtained as model[keyboard] in the Python/Gensim environment. Lets denote it with v_{keyboard}.

Now consider yourself in the position of someone building a Mind Map. When you think of ‘keyboard’, given the structure of what you have come up with so far, you will be thinking of it in the context of ‘computer’ and ‘hardware’. Thats why you probably won’t link ‘keyboard’ to ‘music’ (atleast not directly). This basically shows that the contextual vector for ‘keyboard’ (lets call it v'_{keyboard}) must be biased in its direction (since we use cosine similarity with Word2Vec models, only directions matter) towards v_{computer} and v_{hardware}. Moreover, intuitively speaking, the influence of v_{hardware} on v'_{keyboard} should be greater than that of v_{computer} – in essence, the influence of the context of a parent reduces as you go further and further away from it. To take this into account, I use what I call the contextual decay factor \alpha. Expressing it mathematically,

v'_{computer} = v_{computer}

v'_{hardware} = (1 - \alpha) v_{hardware} + \alpha v'_{computer}

v'_{keyboard} = (1 - \alpha) v_{keyboard} + \alpha v'_{hardware}

And so on…

Finally, to generate the actual Mind-Map, heres the algorithm I use (I hope the inline comments are enough to let you know what I have done):


from scipy.spatial.distance import cosine
from networkx import Graph

def build_mind_map(model, stemmer, root, nodes, alpha=0.2):
    """
    Returns the Mind-Map in the form of a NetworkX Graph instance.

    'model' should be an instance of gensim.models.Word2Vec
    'nodes' should be a list of terms, included in the vocabulary of
    'model'.
    'root' should be the node that is to be used as the root of the Mind
    Map graph.
    'stemmer' should be an instance of StemmingHelper.
    """

    #This will be the Mind-Map
    g = Graph()

    #Ensure that the every node is in the vocabulary of the Word2Vec
    #model, and that the root itself is included in the given nodes
    for node in nodes:
        if node not in model.vocab:
            raise ValueError(node + " not in model's vocabulary")
    if root not in nodes:
        raise ValueError("root not in nodes")

    ##Containers for algorithm run
    #Initially, all nodes are unvisited
    unvisited_nodes = set(nodes)
    #Initially, no nodes are visited
    visited_nodes = set([])
    #The following will map visited node to its contextual vector
    visited_node_vectors = {}
    #Thw following will map unvisited nodes to (closest_distance, parent)
    #parent will obviously be a visited node
    node_distances = {}

    #Initialization with respect to root
    current_node = root
    visited_node_vectors[root] = model[root]
    unvisited_nodes.remove(root)
    visited_nodes.add(root)

    #Build the Mind-Map in n-1 iterations
    for i in range(1, len(nodes)):
        #For every unvisited node 'x'
        for x in unvisited_nodes:
            #Compute contextual distance between current node and x
            dist_from_current = cosine(visited_node_vectors[current_node],
                                       model[x])
            #Get the least contextual distance to x found until now
            distance = node_distances.get(x, (100, ''))
            #If current node provides a shorter path to x, update x's
            #distance and parent information
            if distance[0] > dist_from_current:
                node_distances[x] = (dist_from_current, current_node)

        #Choose next 'current' as that unvisited node, which has the
        #lowest contextual distance from any of the visited nodes
        next_node = min(unvisited_nodes,
                        key=lambda x: node_distances[x][0])

        ##Update all containers
        parent = node_distances[next_node][1]
        del node_distances[next_node]
        next_node_vect = ((1 - alpha)*model[next_node] +
                          alpha*visited_node_vectors[parent])
        visited_node_vectors[next_node] = next_node_vect
        unvisited_nodes.remove(next_node)
        visited_nodes.add(next_node)

        #Add the link between newly selected node and its parent(from the
        #visited nodes) to the NetworkX Graph instance
        g.add_edge(stemmer.original_form(parent).capitalize(),
                   stemmer.original_form(next_node).capitalize())

        #The new node becomes the current node for the next iteration
        current_node = next_node

    return g

Notes: I use NetworkX’s simple Graph-building infrastructure to do the core job of maintaining the Mind-Map (makes it easier later for visualization too). To compute cosine distance, I use SciPy. Moreover, on lines 74 and 75, I use the StemmingHelper class from my last post to include the stemmed words in their original form in the actual mind-map (instead of their stemmed versions). You can pass the StemmingHelper class directly as the parameter stemmer. On the other hand, if you aren’t using stemming at all, just remove those parts of the code on lines 4, 74, and 75.

If you look at the algorithm, you will realize that its somewhat like Dijkstra’s algorithm for single-source shortest paths, but in a different context.

Example Outputs

Now for the results. (I used PyGraphViz for simple, quick-and-dirty visualization)

Heres the Mind-Map that was generated for the Wikipedia article on Machine Learning:

ml

One on Artificial Intellgence:

aiAnd finally, one on Psychology:

yoyo

The results are similar on all the other topics I tried out. A few things I noticed:

i. I should try and involve bi-grams and trigrams too. I am pretty sure it will make the Word2Vec model itself way stronger, and thus improve the interpretation of terms with respect to the document.

ii. There are some unnecessary terms in the Mind Maps, but given the short length of the texts (compared to most text mining tasks), the Keyword extraction algorithm in the paper I mentioned before, seems really good.

iii. Maybe one could use this for brainstorming. Like you start out with a term(node) of your choice. Then, the framework suggests you terms to connect it to. Once you select one of them, you get further recommendations for it based on the context, etc. – Something like a Mind-Map helper.

Anyways, this was a long post, and thanks a lot if you stuck to the end :-). Cheers!

Generating a Word2Vec model from a block of Text using Gensim (Python)

screen-shot-2015-04-10-at-4-16-00-pm

Word2Vec is a semantic learning framework that uses a shallow neural network to learn the representations of words/phrases in a particular text. Simply put, its an algorithm that takes in all the terms (with repetitions) in a particular document, divided into sentences, and outputs a vectorial form of each. The ‘advantage’ word2vec offers is in its utilization of a neural model in understanding the semantic meaning behind those terms. For example, a document may employ the words ‘dog’ and ‘canine’ to mean the same thing, but never use them together in a sentence. Ideally, Word2Vec would be able to learn the context and place them together in its semantic space. Most applications of Word2Vec using cosine similarity to quantify closeness. This Quora question (or rather its answers) does a good job of explaining the intuition behind it.

You would need to take the following steps to develop a Word2Vec model from a block of text (Usually, documents that are extensive and yet stick to the topic of interest with minimum ambiguity do well):

[I use Gensim’s Word2Vec API in Python to form Word2Vec models of Wikipedia articles.]

1. Obtain the text (obviously)

To obtain the Wikipedia articles, I use the Python wikipedia library. Once installed from the link, here’s how you could use it obtain all the text from an aritcle-


#'title' denotes the exact title of the article to be fetched
title = "Machine learning"
from wikipedia import page
wikipage = page(title)

You could then use wikipage.context to access the entire textual context in the form of a String. Now, incase you don’t have the exact title and want to do a search, you would do:


from wikipedia import search, page
titles = search('machine learning')
wikipage = page(titles[0])

[Tip: Store the content into a file and access it from there. This would provide you a reference later, if needed.]

2. Preprocess the text

In the context of Python, you would require an iterable that yields one iterable for each sentence in the text. The inner iterable would contain the terms in the particular sentence. A ‘term’ could be individual words like ‘machine’, or phrases(n-grams) like ‘machine learning’, or a combination of both. Coming up with appropriate bigrams/trigrams is a tricky task on its own, so I just stick to unigrams.

First of all, I remove all special characters and short lines from the article, to eliminate noise. Then, I use Porter Stemming on my unigrams, using a ‘wrapper’ around Gensim’s stemming API.


from gensim.parsing import PorterStemmer
global_stemmer = PorterStemmer()

class StemmingHelper(object):
    """
    Class to aid the stemming process - from word to stemmed form,
    and vice versa.
    The 'original' form of a stemmed word will be returned as the
    form in which its been used the most number of times in the text.
    """

    #This reverse lookup will remember the original forms of the stemmed
    #words
    word_lookup = {}

    @classmethod
    def stem(cls, word):
        """
        Stems a word and updates the reverse lookup.
        """

        #Stem the word
        stemmed = global_stemmer.stem(word)

        #Update the word lookup
        if stemmed not in cls.word_lookup:
            cls.word_lookup[stemmed] = {}
        cls.word_lookup[stemmed][word] = (
            cls.word_lookup[stemmed].get(word, 0) + 1)

        return stemmed

    @classmethod
    def original_form(cls, word):
        """
        Returns original form of a word given the stemmed version,
        as stored in the word lookup.
        """

        if word in cls.word_lookup:
            return max(cls.word_lookup[word].keys(),
                       key=lambda x: cls.word_lookup[word][x])
        else:
            return word

Refer to the code and docstrings to understand how it works. (Its pretty simple anyways). It can be used as follows-


>>> StemmingHelper.stem('learning')
'learn'
>>> StemmingHelper.original_form('learn')
'learning'

Pre-stemming, you could also use a list of stopwords to eliminate terms that occur frequently in the English language, but don’t carry much semantic meaning.

After your pre-processing, lets assume you come up with an iterable called sentences from your source of text.

3. Figure out the values for your numerical parameters

Gensim’s Word2Vec API requires some parameters for initialization. Ofcourse they do have default values, but you want to define some on your own:

i. size – Denotes the number of dimensions present in the vectorial forms. If you have read the document and have an idea of how many ‘topics’ it has, you can use that number. For sizeable blocks, people use 100-200. I use around 50 for the Wikipedia articles. Usually, you would want to repeat the initialization for different numbers of topics in a certain range, and pick the one that yields the best results (depending on your application – I will be using them to build Mind-Maps, and I usually have to try values from 20-100.). A good heuristic thats frequently used is the square-root of the length of the vocabulary, after pre-processing.

ii. min_count – Terms that occur less than min_count number of times are ignored in the calculations. This reduces noise in the semantic space. I use 2 for Wikipedia. Usually, the bigger and more extensive your text, the higher this number can be.

iii. window – Only terms hat occur within a window-neighbourhood of a term, in a sentence, are associated with it during training. The usual value is 4. Unless your text contains big sentences, leave it at that.

iv. sg – This defines the algorithm. If equal to 1, the skip-gram technique is used. Else, the CBoW method is employed. (Look at the aforementioned Quora answers). I usually use the default(1).

4. Initialize the model and use it

The model can be generated using Gensim’s API, as follows:


from gensim.models import Word2Vec
min_count = 2
size = 50
window = 4

model = Word2Vec(sentences, min_count=min_count, size=size, window=window)

Now that you have the model initialized, you can access all the terms in its vocabulary, using something like list(model.vocab.keys()). To get the vectorial representation of a particular term, use model[term]. If you have used my stemming wrapper, you could find the appropriate original form of the stemmed terms using StemmingHelper.original_form(term). Heres an example, from the Wiki article on Machine learning:


>>> vocab = list(model.vocab.keys())
>>> vocab[:10]
[u'represent', u'concept', u'founder', u'focus', u'invent', u'signific', u'abil', u'implement', u'benevol', u'hierarch']
>>> 'learn' in model.vocab
True
>>> model['learn']
array([  1.23792759e-03,   5.49776992e-03,   2.18261080e-03,
         8.37465748e-03,  -6.10323064e-03,  -6.94877980e-03,
         6.29429379e-03,  -7.06598908e-03,  -7.16267806e-03,
        -2.78065586e-03,   7.40372669e-03,   9.68673080e-03,
        -4.75220988e-03,  -8.34807567e-03,   5.25208283e-03,
         8.43616109e-03,  -1.07231298e-02,  -3.88528360e-03,
        -9.20894090e-03,   4.17305576e-03,   1.90116244e-03,
        -1.92442467e-03,   2.74807960e-03,  -1.01113841e-02,
        -3.71694425e-03,  -6.60350174e-03,  -5.90716442e-03,
         3.90679482e-03,  -5.32188127e-03,   5.63300075e-03,
        -5.52612450e-03,  -5.57334488e-03,  -8.51202477e-03,
        -8.78736563e-03,   6.41061319e-03,   6.64879987e-03,
        -3.55080629e-05,   4.81080823e-03,  -7.11903954e-03,
         9.83678619e-04,   1.60697231e-03,   7.42980337e-04,
        -2.12235347e-04,  -8.05167668e-03,   4.08948492e-03,
        -5.48054813e-04,   8.55423324e-03,  -7.08682090e-03,
         1.57684216e-03,   6.79725129e-03], dtype=float32)
>>> StemmingHelper.original_form('learn')
u'learning'
>>> StemmingHelper.original_form('hierarch')
u'hierarchical'

As you might have guessed, the vectors are NumPy arrays, and support all their functionality. Now, to compute the cosine similarity between two terms, use the similarity method. Cosine similarity is generally bounded by [-1, 1]. The corresponding ‘distance’ can be measured as 1-similarity. To figure out the terms most similar to a particular one, you can use the most_similar method.


>>> model.most_similar(StemmingHelper.stem('classification'))
[(u'spam', 0.25190210342407227), (u'metric', 0.22569453716278076), (u'supervis', 0.19861873984336853), (u'decis', 0.18607790768146515), (u'inform', 0.17607420682907104), (u'artifici', 0.16593246161937714), (u'previous', 0.16366994380950928), (u'train', 0.15940310060977936), (u'network', 0.14765430986881256), (u'term', 0.14321796596050262)]
>>> model.similarity(StemmingHelper.stem('classification'), 'supervis')
0.19861870268896875
>>> model.similarity('unsupervis', 'supervis')
-0.11546791800661522

There’s a ton of other functionality that’s supported by the class, so you should have a look at the API I gave a link to. Happy topic modelling 🙂