# Weekly Review: 12/23/2017

Happy Holidays people! If you live in the Bay Area then the next week is probably your time off, so I hope you have fun and enjoy the holiday season! As for Robotics, I just finished Week 2 of Perception, and will probably kick off Week 3 in 2018. I am excited for the last ‘real’ course (Estimation & Learning), and then building my own robot as part of the ‘Capstone’ project after that :-D.

This week’s articles:

XGBoost

I recently came across XGBoost (eXtreme Gradient Boosting), an improvement over standard Gradient Boosting – thats actually a shame, considering how popular this method is in Data Science. If you are rusty on ensemble learning, take a look at this article on bagging/random Forests, and my own intro to Boosting.

XGBoost is one of the most efficient versions of Gradient Boosting, and apparently works really well on structured/tabular data. It also provides features such as sparse-awareness (being able to handle missing values), and the ability to update models with ‘continued training’. Its effectiveness for tabular data has made it very popular with Kaggle winners, with one of them quoting: “When in doubt, use xgboost”!

Take a look at the original paper to dig deeper.

Quantum Computing + Machine Learning

A lot of companies, such as Google, Microsoft, etc have recently shown interest in the domain of Quantum Computing. Rigetti happens to be a startup that aims to rival these juggernauts with its great solution to cloud-Quantum Computing (called Forest). They even have their own Python integration!

The article in question details their efforts to prototype simple clustering with quantum computing. It is still pretty crude, and is by no means a replacement to traditional systems – for now. One of the major critical points is “Applying Quantum Computing to Machine Learning will only make a black-box system more difficult to understand”. This is infact true, but the author suggests that ML could actually/maybe help us understand the behavior of Quantum Computers by modelling them!

Breaking a CAPTCHA with ML

A simple, easy-to-read, fun article on how you could break the simplest CAPTCHA algorithms with CV+Deep Learning.

Learning Indexing Structures with ML

Indexing structures are essentially data structures meant for efficient data access. For example, a B-Tree Index is used for efficient range-queries, a Hash-table is used for fast key-based access, etc. However, all of these data structures are pretty rigid in their behavior – they do not fine-tune/change their parameters based on the structure of the data.

This paper (that includes the Google legend Jeff Dean as an author) explores the possibility of using Neural Networks (infact, a hierarchy of them) as indexing structures. Basically, you would use a Neural Network to compute the function – f: data -> hash/position.

Some key takeaways from the paper:

1. Range Index models essentially ‘learn’ a cumulative distribution function.
2. The overall ‘learned index’ by this paper is a hierarchy of models (but not a tree, since two models at a certain layer can point to the same model in the next layer)
1. As you go down the layers, the models deal with smaller and smaller subsets of the data.
3. Unlike a B-Tree, no ‘search’ involved, since each model predicts the next model for hash generation.

Tacotron 2

This post on the Google Research blog details the development of a WaveNet-like framework to generate Human Speech from text.

# Weekly Review: 11/18/2017

I finished the Motion Planning course from Robotics this week. It was expected, since the material was quite in line with data structures and algorithms that I have studied during my undergrad. The next one, Mobility, seems to be a notch tougher than Aerial Robotics, mainly because of the focus on calculus and physics (neither of which I have touched heavily in years).

Heres the articles this week:

Neural Networks: Software 2.0

In this article from Medium, the Director of AI at Tesla gives a fresh perspective on NNs. He refers to the set of weights in a Neural Network as a program which is learnt, as opposed to coded in by a human. This line of thought is justified by the fact that many decisions in Robotics, Search, etc. are taken by parametric ML systems. He also compares it to traditional ‘Software 1.0’, and points out the benefits of each.

Baselines in Machine Learning

In this article, a senior Research Scientist from Salesforce points out that we need to pay greater attention to baselines in Machine Learning. A baseline is any meaningful ‘benchmark’ algorithm that you would compare your algorithm against. The actual reference point would depend on your task – random/stratified systems for classification, state-of-the-art CNNs for image processing, etc. Read Neal’s answer to this Quora question for a deeper understanding.

The article ends with a couple of helpful tips, such as:

1. Use meaningful baselines, instead of using very crude code. The better your baseline, the more meaningful your results.
2. Start off with optimizing the baseline itself. Tune the weights, etc. if you have to – this gives you a good base to start your work on.

TensorFlow Lite

TensorFlow Lite is now in the Developer Preview mode. It is a light-weight platform for inference (not training) using ML models on mobile/embedded devices. Google calls it an ‘evolution of TensorFlow mobile’. While the latter is still the system you should use in production, TensorFlow lite appears to perform better on many benchmarks (Differences here). Some of the major plus-points of this new platform are smaller binaries, and support for custom ML-focussed hardware accelerators via the Android Neural Networks API.

Flatbuffers

Reading up on Tensorflow Lite also brought me to Flatbuffers, which are a ‘liter’ version of Protobufs. Flatbuffer is a data serialization library  for performance-critical applications. Flatbuffers provide the benefits of a smaller memory footprint and lesser generated code, mainly due to skipping of the parsing/unpacking step. Heres the Github repo.

This YCombinator article gives a nice overview of Adversarial attacks on ML models – attacks that provide ‘noisy’ data inputs to intelligent systems, in order to get a ‘wrong’ output. The author points out how Gradient descent can be used to sort-of reverse engineer spurious noise, in order to get data ‘misclassified’ by a neural network. The article also shows examples of such faulty inputs, and they are surprisingly indistinguishable from the original data!

# Weekly Review: 11/04/2017

A busy week. I finished my Aerial Robotics course! The next in the Specialization is Computational Motion Planning, which I am more excited about – mainly because the curriculum goes more towards my areas of expertise. Aerial Robotics was challenging primarily because I was doing a lot of physics/calculus which I had not attempted since a long time.

Onto the articles for this week:

Colab is now public!

Google made Colaboratory, a previously-internal tool public. ‘Colab’ is a document-collaboration tool, with the added benefits of being able to run script-sized pieces of code. This is especially useful if you want to prototype small proofs-of-concept, which can then be shared with documentation and demo-able output. I had previously used it within Google to tinker with TensorFlow, and write small scripts for database queries.

Visual Guide to Evolution Strategies

The above link is a great introduction to Evolutionary Strategies such as GAs and CMA-ES. They show a visual representation of how each of these algorithms converges on the optima from the first iteration to the last on simple problems. Its pretty interesting to see how each algorithm ‘broadens’ or ‘focuses’ the domain of its candidate solutions as iterations go by.

Baidu’s Deep Voice

In a 2-part series (Part 1 & Part 2), the author discusses the architecture of Baidu’s Text-to-Speech system (Deep Voice). Take a look if you have never read about/worked on such systems and want to have a general idea of how they are trained and deployed.

Capsule Networks

Geoff Hinton and his team at Google recently discussed the idea of Capsule networks, which try and remedy the rigidity in usual CNNs – by defining groups of specialized neurons called ‘capsules’ whose contribution to higher-level neurons is decided by the similarity of output. Heres a small intro on Capsule Networks, or the original paper if you wanna delve deeper.

Nexar Challenge Results

Nexar released the results of its Deep-Learning challenge on Image segmentation – the problem of ‘boxing’ and ‘tagging’ objects in pictures with multiple entities present. This is especially useful in their own AI-dashboard apps, which need to be quite accurate to prevent possible collisions in deployment.

As further reading, you could also check out this article on the history of CNNs in Image Segmentation, another one on Region-of-Interest Pooling in CNNs, and Deformable Neural Networks. (All of these concepts are mentioned in the main Nexar article)

# An introduction to Bayesian Belief Networks

A Bayesian Belief Network (BBN), or simply Bayesian Network, is a statistical model used to describe the conditional dependencies between different random variables.

BBNs are chiefly used in areas like computational biology and medicine for risk analysis and decision support (basically, to understand what caused a certain problem, or the probabilities of different effects given an action).

### Structure of a Bayesian Network

A typical BBN looks something like this:

The shown example, ‘Burglary-Alarm‘ is one of the most quoted ones in texts on Bayesian theory. Lets look at the structural characteristics one by one. We will delve into the numbers/tables later.

#### Directed Acyclic Graph (DAG)

We obviously have one node per random variable.

Directed: The connections/edges denote cause->effect relationships between pairs of nodes. For example Burglary->Alarm in the above network indicates that the occurrence of a burglary directly affects the probability of the Alarm going off (and not the other way round). Here, Burglary is the parent, while Alarm is the child node.

Acyclic: There cannot be a cycle in a BBN. In simple English, a variable $A$ cannot depend on its own value – directly, or indirectly. If this was allowed, it would lead to a sort of infinite recursion which we are not prepared to deal with. However, if you do realize that an event happening affects its probability later on, then you could express the two occurrences as separate nodes in the BBN (or use a Dynamic BBN).

#### Parents of a Node

One of the biggest considerations while building a BBN is to decide which parents to assign to a particular node. Intuitively, they should be those variables which most directly affect the value of the current node.

Formally, this can be stated as follows: “The parents of a variable $X$ ($parents(X)$) are the minimal set of ancestors of $X$, such that all other ancestors of $X$ are conditionally independent of $X$ given $parents(X)$“.

Lets take this step by step. First off, there has to be some sort of a cause-effect relationship between $Y$ and $X$ for $Y$ to be one of the ancestors of $X$. In the shown example, the ancestors of Mary Calls are Burglary, Earthquake and Alarm.

Now consider the two ancestors Alarm and Earthquake. The only way an Earthquake would affect Mary Calls, is if an Earthquake causes Alarm to go off, leading to Mary Calls. Suppose someone told you that Alarm has in fact gone off. In this case, it does not matter what lead to the Alarm ringing – since Mary will react to it based on the stimulus of the Alarm itself. In other words, Earthquake and Mary Calls become conditionally independent if you know the exact value of Alarm.

Mathematically speaking, $P(Mary Calls|Alarm,Earthquake) == P(Mary Calls|Alarm)$.

Thus, $parents(X)$ are those ancestors which do not become conditionally independent of $X$ given the value of some other ancestor. If they do, then the resultant connection would actually be redundant.

#### Disconnected Nodes are Conditionally Independent

Based on the directed connections in a BBN, if there is no way to go from a variable $X$ to $Y$ (or vice versa), then $X$ and $Y$ are conditionally independent. In the example BBN, pairs of variables that are conditionally independent are {Mary Calls, John Calls} and {Burglary, Earthquake}.

It is important to remember that ‘conditionally independent’ does not mean ‘totally independent’. Consider {Mary Calls, John Calls}. Given the value of Alarm (that is, whether the Alarm went off or not), Mary and John each have their own independent probabilities of calling. However, if you did not know about any of the other nodes, but just that John did call, then your expectation of Mary calling would correspondingly increase.

### Mathematics behind Bayesian Networks

BBNs provide a mathematically correct way of assessing the effects of different events (or nodes, in this context) on each other. And these assessments can be made in either direction – not only can you compute the most likely effects given the values of certain causes, but also determine the most likely causes of observed events.

The numerical data provided with the BBN (by an expert or some statistical study) that allows us to do this is:

1. The prior probabilities of variables with no parents (Earthquake and Burglary in our example).
2. The conditional probabilities of any other node given every value-combination of its parent(s). For example, the table next to Alarm defines the probability that the Alarm will go off given the whether an Earthquake and/or Burglary have occurred.

In case of continuous variables, we would need a conditional probability distribution.

The biggest use of Bayesian Networks is in computing revised probabilities. A revised probability defines the probability of a node given the values of one or more other nodes as a fact. Lets take an example from the Burglary-Alarm BBN.

Suppose we want to calculate the probability that an earthquake occurred, given that the alarm went off, but there was no burglary. Essentially, we want $P(Earthquake|Alarm,\sim Burglary)$. Simplifying the nomenclature a bit, $P(E|A,\sim B)$.

Here, you can say that the Alarm going off ($A$) is evidence, the knowledge that the Burglary did not happen ($\sim B$) is context and the Earthquake occurring ($E$) is the hypothesis. Traditionally, if you knew nothing else, $P(E) = 0.002$, from the diagram. However, with the context and evidence in mind, this probability gets changed/revised. Hence, its called ‘computing revised probabilities’.

A version of Bayes Theorem states that

$P(X|YZ) = \frac{P(X|Z)P(Y|XZ)}{P(Y|Z)}$ …(1)

where $X$ is the hypothesis, $Y$ is the evidence, and $Z$ is the context. The numerator on the RHS denotes that probability that $X$$Y$ both occur given $Z$, which is a subset of the probability that atleast $Y$ occurs given $Z$, irrespective of $X$.

Using (1), we get

$P(E|A, \sim B) = \frac{P(E|\sim B)P(A|\sim B, E)}{P(A|\sim B)}$ …(2)

Since $E$ and $B$ are independent phenomena without knowledge of $A$,

$P(E|\sim B) = P(E) = 0.002$ …(3)

From the table for $A$,

$P(A|\sim B, E) = 0.29$ …(4)

Finally, using the Total Probability Theorem,

$P(A| \sim B) = P(E) P(A| E, \sim B) + P(\sim E) P(A| \sim E, \sim B)$ …(5)

Which is nothing but average of $P(A| E, \sim B)$$P(A| \sim E, \sim B)$, weighted on $P(E)$$P(\sim E)$ respectively.

Substituting values in (5),

$P(A| \sim B) = 0.002 * 0.29 + 0.998 * 0.001 = 0.001578$ …(6)

From (2), (3), (4), & (6), we get

$P(E|A, \sim B) = 0.367$

As you can see, the probability of the Earthquake actually increases if you know that the Alarm went off but a Burglary was not the cause of it. This should make sense intuitively as well. Which brings us to the final part –

### The ‘Explain Away’ Effect

The Explain Away effect, commonly associated with BBNs, is a result of computing revised probabilities. It refers to the phenomenon where knowing that one cause has occurred, reduces (but does not eliminate) the probability that the other cause(s) took place.

Suppose instead of knowing that there has been no burglary like in our example, you infact did know that one has taken place. It also led to the Alarm going off. With this information in mind, your tendency to check out the ‘earthquake’ hypothesis reduces drastically. In other words, the burglary has explained away the alarm.

It is important to note that the probability for other causes just gets reduced, but does NOT go down to zero. In a stroke of bad luck, it could have happened that both a burglary and an earthquake happened, and any one of the two stimuli could have led to the alarm ringing. To what extent you can ‘explain away’ an evidence depends on the conditional probability distributions.

# Understanding the new Google Translate

Google launched a new version of the Translate in September 2016. Since then, there have been a few interesting developments in the project, and this post attempts to explain it all in as simple terms as possible.

The earlier version of the Translate used Phrase-based Machine Translation, or PBMT. What PBMT does is break up an input sentence into a set of words/phrases and translate each one individually. This is obviously not an optimal strategy, since it completely misses out on the context of the overall sentence. The new Translate uses what Google calls Google Neural Machine Translation (GNMT), an improvement over a traditional version of NMT. Lets see how GNMT works on a high-level:

### The Encoder

Before you understand the encoder, you must understand what an LSTM (Long-Short-Term-Memory) cell is. It is basically a Neural Network with some concept of memory. An LSTM is generally used to ‘learn’ patterns in time-series/temporal data. At any given point, it accepts the latest input vector and produces the intended output using a combination of (the latest input + some ‘context’ regarding what it saw before):

In the above picture, $x_t$ is the input at time $t$. $h_{t-1}$ represents the context from $t-1$. If $x_t$ has a dimensionality of $d$, $h_{t-1}$ of dimensionality $2d$ is a concatenation of two vectors:

1. The intended output by the same LSTM at the last time-step $t-1$ (the Short Term memory), and
2. Another $d$-dimensional vector encoding the Long Term memory – also called the Cell State.

The second part is usually not of use for the next component in the architecture. It is instead used by the same LSTM for the following step. LSTMs are usually trained by providing them with a ton of example input-series with the expected outputs. This enables them to learn what parts of the input to retain/hold, and how to mathematically process $x_t$ and $h_{t-1}$ to come up with $h_t$. If you wish to understand LSTMs better, I recommend this blog post by Christopher Olah.

An LSTM can also be ‘unfolded’, as shown below:

Don’t worry, they are copies of the the same LSTM cell (hence same training), each feeding their output to the next one in line. What this allows us to do is give in the entire set of input vectors (in essence, the whole time-series) all at once, instead of going step-by-step with a single copy of the LSTM.

GNMT’s encoder network is essentially a series of stacked LSTMs:

Each horizontal line of pink/green boxes is an ‘unfolded’ LSTM on its own. The above figure therefore has 8 stacked LSTMs in a series. The input to the whole architecture is the ordered set of tokens in the sentence, each represented in the form of a vector. Mind you, I said tokens – not words. What GNMT does in pre-processing, is break up all words into tokens/pieces, which are then fed as a series to the neural network. This enables the framework to (atleast partially) understand unseen complicated words. For example, suppose I say the word ‘Pteromerhanophobia‘. Even though you may not know exactly what it is, you can tell me that it is some sort of fear based on the token ‘phobia‘. Google calls this approach Wordpiece modeling. The break-up of words into tokens is done based on statistical learning (which group of tokens make most sense?) from a huge vocabulary in the training phase.

When you stack LSTMs, each layer learns a pattern in the time series fed to it by the earlier (lower) layer. As you go higher up the ladder, you see more and more abstract patterns from the data that was fed in to the lowest layer. For example, the lowest layer might see a set of points and deduce a line, the next layer will see a set of lines and deduce a polygon, the next will see a set of polygons and learn an object, and so on… Ofcourse, there is a limit to how many and in what way you should stack LSTMs together – more is not always better, since you will ultimately end up with a model thats too slow and difficult to train.

There are a few interesting things about this architecture shown above, apart from the stacking of LSTMs.

You will see that the second layer from the bottom is green in color. This is because the arrows – the ordering of tokens in the sentence – is reversed for this layer. Which means that the second LSTM sees the entire sentence in reverse order. The reason to do this is simple: When you look at a sentence as a whole, the ‘context’ for any word is not just contained in the words preceding it, but also in the words following it. The two bottom-most layers both see the raw sentence as input, but in opposite order. The third LSTM gets this bidirectional input from the first two layers – basically, a combination of the forward and backward context for any given word. Each layer from this point on learns higher-level patterns in the contextual meanings of words in the sentence.

You might also have noticed the ‘+’ signs that appear before providing inputs to the fifth layer and above. This is a form of Residual Learning. This is what happens from layer 5 onwards: For every layer $N+1$, the input is an addition of the output of layers $N$ and $N-1$. Take a look at my post on Residual Neural Networks to get a better understanding of what this does.

Lastly, you can see the extra <2es> and </s> characters at the end of the input to the encoder. </s> represents ‘end of input’. <2es>, on the other hand, represents the Target Language – in this case, Spanish. GNMT does this unique thing where they provide the Target Language as input to the framework, to improve performance of Translate. More on this later.

### Attention Module and the Decoder

The Encoder produces a set of ordered output-vectors (one for each token in the input). These are then fed into the Attention Module & Decoder framework. To a large extent, the Decoder is similar to the Encoder in design- stacked LSTMs and residual connections. Lets discuss the parts that are different.

I have already mentioned that GNMT considers the entire sentence as input, in every sense. However, it is intuitive to think that for every token that the decoder will produce, it should not give equal weightage to all vectors(tokens) in the input sentence. As you write out one part of the story, your focus should slowly drift to the rest of it. This work is done by the Attention Module. What the Attention Module gets as input, is the complete output of the Encoder and the latest vector from the Decoder stack. This lets it ‘understand’ how much/what has already been translated, and it then directs the Decoder to shift attention to the other parts of the Encoder output.

The Decoder LSTM-stack keeps outputting vectors based on the input from the Encoder and directions from the Attention module. These vectors are given to the Softmax Layer. You can think of the Softmax Layer as a Probability distribution-generator. Based on the incoming vector from the topmost LSTM, the Softmax Layer assigns a probability to every possible output token (remember the target language was already provided to the Encoder, so that information has already been propagated). The token that gets the maximum probability is written out.

The whole process stops once the Decoder/Softmax decides that the current token is </s> (or end-of-sentence). Note that the Decoder does not have to follow a number of steps equal to the output vectors from the Encoder, since it is paying weighted attention to all of those at every step of computation.

Overall, this is how you can visualize the  complete translation process:

### Training & Zero-Shot Translation

The complete framework (Encoder+Attention+Decoder) is trained by providing it a huge collection of (input, translated) pairs of sentences. The architecture ‘knows’ the input language in a sense when it converts tokens from the incoming sentence to the appropriate vector format. The target language is provided as a parameter as well. The brilliance of deep-LSTMs lies in the fact that the neural network learns all of the computational stuff by itself, using a class of algorithms called Backpropagation/Gradient Descent.

Heres another amazing discovery made by the GNMT team: Simply by providing the target language as an input to the framework, it is able to perform Zero-Shot Translation! What this basically means is: If during training you provide it examples of English->Japanese & English->Korean translations, GNMT automatically does Japanese->Korean reasonably well! In fact, this is the biggest achievement of GNMT as a project. The intuition: what the Encoder essentially produces is a form of interlingua (or universal language). Whenever I say ‘dog‘ in any language, you end up thinking of a friendly canine – essentially, the concept of ‘dog‘. This ‘concept’ is what is produced by the Encoder, and it is irrespective of any language. In fact, some articles went so far as to say that Google’s AI had invented a language of its own :-D.

Providing the target language as input allows GNMT to easily use the same neural network for training with any pair of languages, which in turn allows zero-shot translations. As a result, the new Translate gets closer than ever before to the way humans perform translations in their mind.

Heres some references if you want to read further on this subject 🙂 :

# On Interpretable Models

Artificial Intelligence is everywhere today. And as intelligent systems get more ubiquitous, the need to understand their behavior becomes equally important. Maybe if you are developing an algorithm to recognize a cat-photo for fun, you don’t need to know how it works as long as it delivers the results. But if you have deployed a model to predict whether a person will default on a loan or not, and you use it to make your decisions, you better be sure you are doing the right thing – for practical, ethical AND legal reasons.

From Dictionary.com,

interpretabilityto give or provide the meaning of; explain; explicate; elucidate

### Why do models need to be interpretable?

The primary reason why we need explainability in AI, is to develop a sense of understanding and trust. Think about it – the only way you would ever delegate an important task to someone else, is if you had a certain level of trust in their thought process. If for instance Siri makes an error in understanding your commands, thats fine. But now consider self-driven cars. The reason why most people would not readily sit in a self-driven car within a city, is because we cannot guarantee if it will do the right thing in every situation. Interpretability is thus crucial for building trust towards models, especially in domains like healthcare, finance and the judicial system.

Interpretability is also important while debugging problems in a model’s performance. These problems might be caused due to the algorithm itself, or the data being used to train it. And you may not really observe these issues until you deploy a compound system that uses this model. Lets take the example of Google’s Word2Vec. Word2Vec is currently one of the best algorithms for computing word-embeddings given a significant amount of text. It was originally trained on a 3 million-word corpus of Google News articles. In some research conducted by people from Boston university and Microsoft Research, they found a ton of hidden sexism in the word-embeddings generated from that dataset. For example, the framework came up with this particular analogy: “man : computer programmer :: woman : homemaker”. Funny ain’t it? This was not a problem with the algorithm itself, but a screw-up of the way news articles are usually written. Quoting the source, “Any bias contained in word embeddings like those from Word2vec is automatically passed on in any application that exploits it.”.

### How do we increase interpretability of models?

There are two ways to promote interpretability when it comes to Machine Learning/AI systems: Transparency, and Post-Hoc Explainability. Algorithmic transparency would mean that you understand the way your model works on an intuitive level, with respect to the dataset you used for training. A Decision Tree, for example, is pretty transparent – in fact, you can use the paths from the root to every leaf node to decompose the Tree into the set of rules used for classification. But a deep Neural Network is not so transparent, for obvious reasons. Though you may understand linear algebra and back-propagation, you will typically not be able to make sense of the weights/biases learned by a deep-NN after training.

Transparency has two aspects: Decomposability, and Simultaneity. Decomposability would mean understanding each individual component of your model. In essence, there should not be a ‘black-box’ component of the system in your eyes. Simultaneity, on the other hand, indicates an understanding of how all these individual components work together as a whole. And the former does not necessarily imply the latter – consider an algorithm as simple as linear regression. You would probably know that if the weight with respect to a predictor is positive after training, it shows a direct proportionality to the target variable. Now, if you train a simple linear regression of Disease-risk vs Vaccination, you would most probably get a negative weight on the Vaccination variable. But if you now take tricky factors such as immunodeficiency or age (old-age or infancy) into the picture, the weight might take on a whole different value. In fact, as the number of predictor variables goes on increasing in regression, it gets more and more difficult to understand how your model will behave as a whole. And thus, the notion that a ‘simple’ model (linear regression) would be far easier to interpret than a ‘complex’ one (deep learning) is misleading.

Post-Hoc means ‘occurring after an event’. In the context of model transparency, post-hoc interpretation would mean an effort to understand its behavior after it has finished training, typically using some test inputs. Some models/algorithms inherently have the ability to ‘explain’ their behavior. For example, take k-NN classifiers. Along with the required output, you can hack (with minimal effort) the model to return the k-nearest neighbors as examples for scrutiny. This way, you get a good idea of the combination of properties that produce similar results by looking at the known training points.

Most algorithms don’t have such easy post-hoc interpretability, though. In such cases, you have to use techniques such as visualization to understand how they behave/work. For instance, you could use a dimensionality reduction technique such as t-SNE to reduce vector points to 2-3 dimensions and visualize class ‘regions’ in 2D/3D space. Essentially, you are enabling easy visualization of higher-dimensional data by embedding it in a lower-dimensional space. Saliency maps are a technique used to interpret deep neural networks. In Natural Language Processing, textual explanations are also being adopted. Since humans usually understand words better than raw numbers, providing text-based explanations makes sense. For example, in a system like LSI, you could ‘understand’ a word’s embedding by (proportionately) looking at the words that strongly belong to the latent topic(s) is most relates to.

### Conclusion and further reading

I did kind-of imply that interpretability is required so that we end up trusting automated systems as much as we trust humans. But as it turns out, its not like human actions are perfectly explainable. There is a ton of research in psychology that clearly indicates that the motivations for our actions are not as clear as we ourselves tend to believe. The Illusion of Conscious Will by Daniel Wegner talks about how our decisions tend to be influenced by subconscious processes without us realizing it. Moreover, it seems contradictory to the ultimate aim of AI to avoid building models that we cannot ‘understand’. If there will be machine intelligence smarter than us, the likelihood of us understanding it completely is pretty slim (Terminator, anyone?).

Heres a couple of links for you to look at, if you want to read more:

# Non-Mathematical Feature Engineering techniques for Data Science

“Apply Machine Learning like the great engineer you are, not like the great Machine Learning expert you aren’t.”

This is the first sentence in a Google-internal document I read about how to apply ML. And rightly so. In my limited experience working as a server/analytics guy, data (and how to store/process it) has always been the source of most consideration and impact on the overall pipeline. Ask any Kaggle winner, and they will always say that the biggest gains usually come from being smart about representing data, rather than using some sort of complex algorithm. Even the CRISP data mining process has not one, but two stages dedicated solely to data understanding and preparation.

So what is Feature Engineering?

Simply put, it is the art/science of representing data is the best way possible.

Why do I say art/science? Because good Feature Engineering involves an elegant blend of domain knowledge, intuition, and basic mathematical abilities. Heck, the most effective data representation ‘hacks’ barely involve any mathematical computation at all! (As I will explain in a short while).

What do I mean by ‘best’? In essence, the way you present your data to your algorithm should denote the pertinent structures/properties of the underlying information in the most effective way possible. When you do feature engineering, you are essentially converting your data attributes into data features.

Attributes are basically all the dimensions present in your data. But do all of them, in the raw format, represent the underlying trends you want to learn in the best way possible? Maybe not. So what you do in feature engineering, is pre-process your data so that your model/learning algorithm has to spend minimum effort on wading through noise. What I mean by ‘noise’ here, is any information that is not relevant to learning/predicting your ultimate goal. In fact, using good features can even let you use considerably simpler models since you are doing a part of the thinking yourself.

But as with any technique in Machine Learning, always use validation to make sure that the new features you introduce really do improve your predictions, instead of adding unnecessary complexity to your pipeline.

As mentioned before, good feature engineering involves intuition, domain knowledge (human experience) and basic math skills. So heres a few extremely simple techniques for you to (maybe) apply in your next data science solution:

### 1. Representing timestamps

Time-stamp attributes are usually denoted by the EPOCH time or split up into multiple dimensions such as (Year, Month, Date, Hours, Minutes, Seconds). But in many applications, a lot of that information is unnecessary. Consider for example a supervised system that tries to predict traffic levels in a city as a function of Location+Time. In this case, trying to learn trends that vary by seconds would mostly be misleading. The year wouldn’t add much value to the model as well. Hours, day and month are probably the only dimensions you need. So when representing the time, try to ensure that your model does require all the numbers you are providing it.

And not to forget Time Zones. If your data sources come from different geographical sources, do remember to normalize by time-zones if needed.

### 2. Decomposing Categorical Attributes

Some attributes come as categories instead of numbers. A simple example would be a ‘color’ attribute that is (say) one of {Red, Green, Blue}. The most common way to go about representing this, is to convert each category into a binary attribute that takes one value out of {0, 1}. So you basically end up with a number of added attributes equal to the number of categories possible. And for each instance in your dataset, only one of them is 1 (with the others being 0). This is a form of one-hot encoding.

If you are new to this idea, you may think of decomposition as an unnecessary hassle (we are essentially bloating up the dimensionality of the dataset). Instead, you might be tempted to convert the categorical attribute into a scalar value. For example, the color feature might take one value from {1, 2, 3}, representing {Red, Green, Blue} respectively. There are two problems with this. First, for a mathematical model, this would mean that Red is somehow ‘more similar’ to Green than Blue (since |1-3| > |1-2|). Unless your categories do have a natural ordering (such as stations on a railway line), this might mislead your model. Secondly, it would make statistical metrics (such as mean) meaningless – or worse, misleading yet again. Consider the color example again. If your dataset contains equal numbers of Red and Blue instances but no Green ones, the ‘average’ value of color might still come out to be ~2 – essentially meaning Green!

The safest place to convert a categorical attribute into a scalar, is when you have only two categories possible. So you have {0, 1} corresponding to {Category 1, Category 2}. In this case, an ‘ordering’ isn’t really required, and you can interpret the value of the attribute as the probability of belonging to Category 2 vs Category 1.

### 3. Binning/Bucketing

Sometimes, it makes more sense to represent a numerical attribute as a categorical one. The idea is to reduce the noise endured by the learning algorithm, by assigning certain ranges of a numerical attribute to distinct ‘buckets’. Consider the problem of predicting whether a person owns a certain item of clothing or not. Age might definitely be a factor here. What is actually more pertinent, is the Age Group. So what you could do, is have ranges such as 1-10, 11-18, 19-25, 26-40, etc. Moreover, instead of decomposing these categories as in point 2, you could just use scalar values, since age groups that lie ‘closer by’ do represent similar properties.

Bucketing makes sense when the domain of your attribute can be divided into neat ranges, where all numbers falling in a range imply a common characteristic. It reduces overfitting in certain applications, where you don’t want your model to try and distinguish between values that are too close by – for example, you could club together all latitude values that fall in a city, if your property of interest is a function of the city as a whole. Binning also reduces the effect of tiny errors, by ’rounding off’ a given value to the nearest representative. Binning does not make sense if the number of your ranges is comparable to the total possible values, or if precision is very important to you.

### 4. Feature Crosses

This is perhaps the most important/useful one of these. Feature crosses are a unique way to combine two or more categorical attributes into a single one. This is extremely useful a technique, when certain features together denote a property better than individually by themselves. Mathematically speaking, you are doing a cross product between all possible values of the categorical features.

Consider a feature A, with two possible values {A1, A2}. Let B be a feature with possibilities {B1, B2}. Then, a feature-cross between A & B (lets call it AB) would take one of the following values: {(A1, B1), (A1, B2), (A2, B1), (A2, B2)}. You can basically give these ‘combinations’ any names you like. Just remember that every combination denotes a synergy between the information contained by the corresponding values of A and B.

For example, take the diagram shown below:

All the blue points belong to one class, and the red ones belong to another. Lets put the actual model aside. First off, you would benefit from binning the X, Y values into {x < 0, x >= 0} & {y < 0, y >= 0} respectively. Lets call them {Xn, Xp} and {Yn, Yp}. It is pretty obvious that Quadrants I & III correspond to class Red, and Quadrants II & IV contain class Blue. So if you could now cross features X and Y into a single feature ‘Quadrant’, you would basically have {I, II, III, IV} being equivalent to {(Xp, Yp), (Xn, Yp), (Xn, Yn), (Xp, Yn)} respectively.

A more concrete/relatable example of a (possibly) good feature cross is something like (Latitude, Longitude). A common Latitude corresponds to so many places around the globe. Same goes for the Longitude. But once you combine Lat & Long buckets into discreet ‘blocks’, they denote ‘regions’ in a geography, with possibly similar properties for each one.

Sometimes, attributes can also be ‘combined’ into a single feature with simple mathematical hacks. In the above example, suppose you define modified features $X_{sign}$ and $Y_{sign}$ as follows:

$X_{sign} = \frac{x}{|x|}$

$Y_{sign} = \frac{y}{|y|}$

Now, you could just define a new feature $Quadrant_{odd}$ as follows:

$Quadrant_{odd} = X_{sign}Y_{sign}$

Thats all! If $Quadrant_{odd} = 1$, the class is Red. Else, Blue!

For sake of completeness, I will also mention some mathematically intensive feature engineering techniques, with links for you to read more about them:

5. Feature Selection : Using certain algorithms to automatically select a subset of your original features, for your final model. Here, you are not creating/modifying your current features, but rather pruning them to reduce noise/redundancy.

6. Feature Scaling : Sometimes, you may notice that certain attributes have a higher ‘magnitude’ than others. An example might be a person’s income – as compared to his age. In such cases, for certain models (such as Ridge Regression), it is infact necessary that you scale all your attributes to comparable/equivalent ranges. This prevents your model from giving greater weightage to certain attributes as compared to others.

7. Feature Extraction : Feature extraction involves a host of algorithms that automatically generate a new set of features from your raw attributes. Dimensionality reduction falls under this category.