# Weekly Review: 12/23/2017

Happy Holidays people! If you live in the Bay Area then the next week is probably your time off, so I hope you have fun and enjoy the holiday season! As for Robotics, I just finished Week 2 of Perception, and will probably kick off Week 3 in 2018. I am excited for the last ‘real’ course (Estimation & Learning), and then building my own robot as part of the ‘Capstone’ project after that :-D.

This week’s articles:

XGBoost

I recently came across XGBoost (eXtreme Gradient Boosting), an improvement over standard Gradient Boosting – thats actually a shame, considering how popular this method is in Data Science. If you are rusty on ensemble learning, take a look at this article on bagging/random Forests, and my own intro to Boosting.

XGBoost is one of the most efficient versions of Gradient Boosting, and apparently works really well on structured/tabular data. It also provides features such as sparse-awareness (being able to handle missing values), and the ability to update models with ‘continued training’. Its effectiveness for tabular data has made it very popular with Kaggle winners, with one of them quoting: “When in doubt, use xgboost”!

Take a look at the original paper to dig deeper.

Quantum Computing + Machine Learning

A lot of companies, such as Google, Microsoft, etc have recently shown interest in the domain of Quantum Computing. Rigetti happens to be a startup that aims to rival these juggernauts with its great solution to cloud-Quantum Computing (called Forest). They even have their own Python integration!

The article in question details their efforts to prototype simple clustering with quantum computing. It is still pretty crude, and is by no means a replacement to traditional systems – for now. One of the major critical points is “Applying Quantum Computing to Machine Learning will only make a black-box system more difficult to understand”. This is infact true, but the author suggests that ML could actually/maybe help us understand the behavior of Quantum Computers by modelling them!

A simple, easy-to-read, fun article on how you could break the simplest CAPTCHA algorithms with CV+Deep Learning.

Learning Indexing Structures with ML

Indexing structures are essentially data structures meant for efficient data access. For example, a B-Tree Index is used for efficient range-queries, a Hash-table is used for fast key-based access, etc. However, all of these data structures are pretty rigid in their behavior – they do not fine-tune/change their parameters based on the structure of the data.

This paper (that includes the Google legend Jeff Dean as an author) explores the possibility of using Neural Networks (infact, a hierarchy of them) as indexing structures. Basically, you would use a Neural Network to compute the function – f: data -> hash/position.

Some key takeaways from the paper:

1. Range Index models essentially ‘learn’ a cumulative distribution function.
2. The overall ‘learned index’ by this paper is a hierarchy of models (but not a tree, since two models at a certain layer can point to the same model in the next layer)
1. As you go down the layers, the models deal with smaller and smaller subsets of the data.
3. Unlike a B-Tree, no ‘search’ involved, since each model predicts the next model for hash generation.

Tacotron 2

This post on the Google Research blog details the development of a WaveNet-like framework to generate Human Speech from text.

# Weekly Review: 11/11/2017

The Motion Planning course is going faster than I expected. I completed 2 weeks within 5 days. Thats good I guess, since it means I might get to the Capstone project before I take a vacation to India.

Heres the stuff from this week:

Graphcore and the Intelligent Processing Unit (IPU)

Graphcore aims to disrupt the world of ML-focussed computing devices. In an interesting blog post, they visualize neuron connections in different CNN architectures, and talk about how they compare to the human brain.

If you are curious about how IPUs differ from CPUs and GPUs, this NextPlatform article gives a few hints: mind you, IPUs are yet to be ‘released’, so theres no concrete information out yet. If you want to brush up on why memory is so important for neural network training (more than inference), this is a good place to start.

Overview of Different CNN architectures

This article on the CV-Tricks blog gives a high-level overview of the major CNN architectures so far: AlexNet, VGG, Inception, ResNets, etc. Its a good place to go for reference if you ever happen to forget what one of them did differently.

On that note, this blog post by Adit Deshpande goes into the ‘Brief History of Deep Learning’, marking out all the main research papers of importance.

Meta-learning and AutoML

The New York Times posted an article about AI systems that can build other AI systems, thus leading to what they call ‘Meta-learning’ (Learning how to learn/build systems that learn).

Google has been dabbling in meta-learning with a project called AutoML. AutoML basically consists of a ‘Generator’ network that comes up with various NN architectures, which are then evaluated by a ‘Scorer’ that trains them and computes their accuracy. The gradients with respect to these scores are passed back to the Generator, in order to improve the output architectures. This is their original paper, in case you want to take a look.

The AutoML team recently wrote another post about large-scale object detection using their algorithms.

Tangent

People from Google recently open-sourced their library for computing gradients of Python functions. Tangent works directly on your Python code(rather than view it as a black-box), and comes up with a derivative function to compute its gradient. This is useful in cases where you might want to debug how/why some NN architecture is not getting trained the way it’s supposed to. Here’s their Github repo.

Reconstructing films with Neural Network

This blog post talks about the use of Autoencoders and GANs to reconstruct films using NNs trained on them. They also venture into reconstructing films using NNs trained on other stylish films (like A Scanner Darkly). The results are pretty interesting.

# Weekly Review: 10/28/2017

This was a pretty busy week with a lot going on, but I finally seem to be settling into my new role!

The study for Aerial Robotics is almost over with a week to go. There hasn’t been much coding in this course, but that was to be expected since it was more about PID-Control Theory and quadrotor dynamics. I am particularly interested in the Capstone/’final’ project for this course, which would involve building an autonomous robot in Pi.

Anyway, on to the interesting tidbits from this week:

AlphaGo Zero

Google’s Deepmind recently announced a new version of their AI-based Go player, the AlphaGo Zero. What makes this one so special, is that it breaks the common notion of intelligent systems requiring a LOT of data to produce decent results. AlphaGo Zero was only provided the basic rules of Go, and it performed the rest of the learning all by playing against itself. Oh and BTW, AlphaGo Zero beats AlphaGo, the previous champion in the game. This is indeed a landmark in demonstrating the power of good-old RL.

Read this article for a basic overview, and their paper in Nature for a detailed explanation. Brushing up on Monte Carlo Tree Search would certainly help.

Word Mover’s Distance

Given an excellent embedding of words such as Word2Vec, it is not very difficult to compute the semantic distance between individual terms. However, when it comes to big blocks of text, a simple ‘average’ over term-embeddings isn’t good enough for computing their relative distances.

In such cases, the Word Mover’s Distance, inspired from Earth Mover’s Distance, provides a better solution. It figures out the semantically closest term(s) from one document to each term in another, and then the average effort required to ‘rephrase’ one text in words of another. Click on the article link for a detailed explanation.

Robots generalizing from simulations

OpenAI posted a blog article about how they trained a robot only through simulations. This means that the robot received no data from sensors during the training phase, but was able to perform basic tasks in deployment after some calibration.

During the simulations, they used dynamics randomization to alter basic traits of the environment. This data was then fed to an LSTM to understand the settings and goals. A key insight from this work is Hindsight Experience Replay. Quoting the article, “Hindsight Experience Replay (HER), allows agents to learn from a binary reward by pretending that a failure was what they wanted to do all along and learning from it accordingly. (By analogy, imagine looking for a gas station but ending up at a pizza shop. You still don’t know where to get gas, but you’ve now learned where to get pizza.)

Concurrency in Go

If you are a Go Programmer, take a look at this old (but good) talk on concurrency patterns and constructs in the language.

Generalization Bounds in Machine Learning

The Generalization Gap for an ML system is defined as the difference between the training error and the generalization error. The Generalization Bound tries to put a bound on this value, based on probability theory. Read this article for a detailed mathematical explanation.

# Understanding the new Google Translate

Google launched a new version of the Translate in September 2016. Since then, there have been a few interesting developments in the project, and this post attempts to explain it all in as simple terms as possible.

The earlier version of the Translate used Phrase-based Machine Translation, or PBMT. What PBMT does is break up an input sentence into a set of words/phrases and translate each one individually. This is obviously not an optimal strategy, since it completely misses out on the context of the overall sentence. The new Translate uses what Google calls Google Neural Machine Translation (GNMT), an improvement over a traditional version of NMT. Lets see how GNMT works on a high-level:

### The Encoder

Before you understand the encoder, you must understand what an LSTM (Long-Short-Term-Memory) cell is. It is basically a Neural Network with some concept of memory. An LSTM is generally used to ‘learn’ patterns in time-series/temporal data. At any given point, it accepts the latest input vector and produces the intended output using a combination of (the latest input + some ‘context’ regarding what it saw before):

In the above picture, $x_t$ is the input at time $t$. $h_{t-1}$ represents the context from $t-1$. If $x_t$ has a dimensionality of $d$, $h_{t-1}$ of dimensionality $2d$ is a concatenation of two vectors:

1. The intended output by the same LSTM at the last time-step $t-1$ (the Short Term memory), and
2. Another $d$-dimensional vector encoding the Long Term memory – also called the Cell State.

The second part is usually not of use for the next component in the architecture. It is instead used by the same LSTM for the following step. LSTMs are usually trained by providing them with a ton of example input-series with the expected outputs. This enables them to learn what parts of the input to retain/hold, and how to mathematically process $x_t$ and $h_{t-1}$ to come up with $h_t$. If you wish to understand LSTMs better, I recommend this blog post by Christopher Olah.

An LSTM can also be ‘unfolded’, as shown below:

Don’t worry, they are copies of the the same LSTM cell (hence same training), each feeding their output to the next one in line. What this allows us to do is give in the entire set of input vectors (in essence, the whole time-series) all at once, instead of going step-by-step with a single copy of the LSTM.

GNMT’s encoder network is essentially a series of stacked LSTMs:

Each horizontal line of pink/green boxes is an ‘unfolded’ LSTM on its own. The above figure therefore has 8 stacked LSTMs in a series. The input to the whole architecture is the ordered set of tokens in the sentence, each represented in the form of a vector. Mind you, I said tokens – not words. What GNMT does in pre-processing, is break up all words into tokens/pieces, which are then fed as a series to the neural network. This enables the framework to (atleast partially) understand unseen complicated words. For example, suppose I say the word ‘Pteromerhanophobia‘. Even though you may not know exactly what it is, you can tell me that it is some sort of fear based on the token ‘phobia‘. Google calls this approach Wordpiece modeling. The break-up of words into tokens is done based on statistical learning (which group of tokens make most sense?) from a huge vocabulary in the training phase.

When you stack LSTMs, each layer learns a pattern in the time series fed to it by the earlier (lower) layer. As you go higher up the ladder, you see more and more abstract patterns from the data that was fed in to the lowest layer. For example, the lowest layer might see a set of points and deduce a line, the next layer will see a set of lines and deduce a polygon, the next will see a set of polygons and learn an object, and so on… Ofcourse, there is a limit to how many and in what way you should stack LSTMs together – more is not always better, since you will ultimately end up with a model thats too slow and difficult to train.

There are a few interesting things about this architecture shown above, apart from the stacking of LSTMs.

You will see that the second layer from the bottom is green in color. This is because the arrows – the ordering of tokens in the sentence – is reversed for this layer. Which means that the second LSTM sees the entire sentence in reverse order. The reason to do this is simple: When you look at a sentence as a whole, the ‘context’ for any word is not just contained in the words preceding it, but also in the words following it. The two bottom-most layers both see the raw sentence as input, but in opposite order. The third LSTM gets this bidirectional input from the first two layers – basically, a combination of the forward and backward context for any given word. Each layer from this point on learns higher-level patterns in the contextual meanings of words in the sentence.

You might also have noticed the ‘+’ signs that appear before providing inputs to the fifth layer and above. This is a form of Residual Learning. This is what happens from layer 5 onwards: For every layer $N+1$, the input is an addition of the output of layers $N$ and $N-1$. Take a look at my post on Residual Neural Networks to get a better understanding of what this does.

Lastly, you can see the extra <2es> and </s> characters at the end of the input to the encoder. </s> represents ‘end of input’. <2es>, on the other hand, represents the Target Language – in this case, Spanish. GNMT does this unique thing where they provide the Target Language as input to the framework, to improve performance of Translate. More on this later.

### Attention Module and the Decoder

The Encoder produces a set of ordered output-vectors (one for each token in the input). These are then fed into the Attention Module & Decoder framework. To a large extent, the Decoder is similar to the Encoder in design- stacked LSTMs and residual connections. Lets discuss the parts that are different.

I have already mentioned that GNMT considers the entire sentence as input, in every sense. However, it is intuitive to think that for every token that the decoder will produce, it should not give equal weightage to all vectors(tokens) in the input sentence. As you write out one part of the story, your focus should slowly drift to the rest of it. This work is done by the Attention Module. What the Attention Module gets as input, is the complete output of the Encoder and the latest vector from the Decoder stack. This lets it ‘understand’ how much/what has already been translated, and it then directs the Decoder to shift attention to the other parts of the Encoder output.

The Decoder LSTM-stack keeps outputting vectors based on the input from the Encoder and directions from the Attention module. These vectors are given to the Softmax Layer. You can think of the Softmax Layer as a Probability distribution-generator. Based on the incoming vector from the topmost LSTM, the Softmax Layer assigns a probability to every possible output token (remember the target language was already provided to the Encoder, so that information has already been propagated). The token that gets the maximum probability is written out.

The whole process stops once the Decoder/Softmax decides that the current token is </s> (or end-of-sentence). Note that the Decoder does not have to follow a number of steps equal to the output vectors from the Encoder, since it is paying weighted attention to all of those at every step of computation.

Overall, this is how you can visualize the  complete translation process:

### Training & Zero-Shot Translation

The complete framework (Encoder+Attention+Decoder) is trained by providing it a huge collection of (input, translated) pairs of sentences. The architecture ‘knows’ the input language in a sense when it converts tokens from the incoming sentence to the appropriate vector format. The target language is provided as a parameter as well. The brilliance of deep-LSTMs lies in the fact that the neural network learns all of the computational stuff by itself, using a class of algorithms called Backpropagation/Gradient Descent.

Heres another amazing discovery made by the GNMT team: Simply by providing the target language as an input to the framework, it is able to perform Zero-Shot Translation! What this basically means is: If during training you provide it examples of English->Japanese & English->Korean translations, GNMT automatically does Japanese->Korean reasonably well! In fact, this is the biggest achievement of GNMT as a project. The intuition: what the Encoder essentially produces is a form of interlingua (or universal language). Whenever I say ‘dog‘ in any language, you end up thinking of a friendly canine – essentially, the concept of ‘dog‘. This ‘concept’ is what is produced by the Encoder, and it is irrespective of any language. In fact, some articles went so far as to say that Google’s AI had invented a language of its own :-D.

Providing the target language as input allows GNMT to easily use the same neural network for training with any pair of languages, which in turn allows zero-shot translations. As a result, the new Translate gets closer than ever before to the way humans perform translations in their mind.

Heres some references if you want to read further on this subject 🙂 :

Hello people! I haven’t really blogged in quite some time, and I kind-of feel guilty about it :-). Truth is, I have been busy starting a new job/life at Google, as a Software Engineer at the Hyderabad office. I joined the Google Apps for Work team, and I work on the analytics part of things. I was pretty swamped with setting things up, getting the formalities done, finding a place to live, blah blah – Basically learning to be an adult! Things have finally settled down now, and I have (I think) found my groove when it comes to my overall routine – so I will probably start blogging as usual in the coming weeks.

To answer the obvious question, life at Google is …well…pretty awesome. Let me write out some points about my experience at the company thats supposed to be one of the best employers in the world. And as per my usual style, they will be bullet points, since putting down a coherent/easy-flowing train of thoughts is just beyond my abilities as a writer.

I. The cool work culture!! I don’t know how many other companies do this (since this is my first real job) but its definitely not all of them. Google doesn’t just talk about an open workspace, it does follow it. My team lead engineer sits on a desk thats exactly like the one I use (except for a lot of Swag he has that I don’t), and I can just go drop by if I have any doubts/issues regarding pretty much anything (even non work related). If my mentor needs to talk to me, he will just come by and sit on a bean-bag lying around my desk and discuss things like a friend at college. You literally won’t be able to tell Tech-levels of people at Google, and thats something I feel is really nice about the office culture.

II. Getting intimidated by the people and the technology. This is a big one. I don’t think I have ever felt so…small…in front of people around me, anytime before in my life. To put it honestly, I felt like an idiot around my team for the first week or two. Not because they acted in any such way – in fact I had to stalk them online to know more about them – but because all of them are basically smartasses. And the other aspect is obviously that you get overwhelmed by the internal infrastructure at a company like Google. Its just so vast and there are so many parts and bits and pieces working together, that its difficult to wrap your head around all of them at first. Its all humbling.

III. The perks!! I don’t really think I need to explain this one (since its covered extensively in articles online). We at Hyderabad don’t have the perks that Mountain View does, but its still pretty darn amazing. Free transport, Bunker rooms, Free food (Oh the damn awesome food :-D), Microkitchens, Games rooms on all floors, Amazing gym, Massage centre, Free internet at home, Matching your charity (if you do any), Techstop for chargers and stuff…the list goes on. Who knows, once we have the campus, the list might expand even further!

Trust me, I know I sound like a freaking fanboy throughout the post. But well… I am barely one year out of my college term, and this is pretty much paradise for a luxury-loving dork like me. I hope to have a great life here at Google, and justify the whole process by making as much of an impact as I can. Cheers!

# Self-Organizing Maps with Google’s TensorFlow

[This post assumes that you know the basics of Google’s TensorFlow library. If you don’t, have a look at my earlier post to get started.]

A Self-Organizing Map, or SOM, falls under the rare domain of unsupervised learning in Neural Networks. Its essentially a grid of neurons, each denoting one cluster learned during training. Traditionally speaking, there is no concept of neuron ‘locations’ in ANNs. However, in an SOM, each neuron has a location, and neurons that lie close to each other represent clusters with similar properties. Each neuron has a weightage vector, which is equal to the centroid of its particular cluster.

AI-Junkie’s post does a great job of explaining how an SOM is trained, so I won’t re-invent the wheel.

The Code

Here’s my code for a 2-D version of an SOM. Its written with TensorFlow as its core training architecture: (Its heavily commented, so look at the inline docs if you want to hack/dig around)

```
import tensorflow as tf
import numpy as np

class SOM(object):
"""
2-D Self-Organizing Map with Gaussian Neighbourhood function
and linearly decreasing learning rate.
"""

#To check if the SOM has been trained
_trained = False

def __init__(self, m, n, dim, n_iterations=100, alpha=None, sigma=None):
"""
Initializes all necessary components of the TensorFlow
Graph.

m X n are the dimensions of the SOM. 'n_iterations' should
should be an integer denoting the number of iterations undergone
while training.
'dim' is the dimensionality of the training inputs.
'alpha' is a number denoting the initial time(iteration no)-based
learning rate. Default value is 0.3
'sigma' is the the initial neighbourhood value, denoting
the radius of influence of the BMU while training. By default, its
taken to be half of max(m, n).
"""

#Assign required variables first
self._m = m
self._n = n
if alpha is None:
alpha = 0.3
else:
alpha = float(alpha)
if sigma is None:
sigma = max(m, n) / 2.0
else:
sigma = float(sigma)
self._n_iterations = abs(int(n_iterations))

##INITIALIZE GRAPH
self._graph = tf.Graph()

##POPULATE GRAPH WITH NECESSARY COMPONENTS
with self._graph.as_default():

##VARIABLES AND CONSTANT OPS FOR DATA STORAGE

#Randomly initialized weightage vectors for all neurons,
#stored together as a matrix Variable of size [m*n, dim]
self._weightage_vects = tf.Variable(tf.random_normal(
[m*n, dim]))

#Matrix of size [m*n, 2] for SOM grid locations
#of neurons
self._location_vects = tf.constant(np.array(
list(self._neuron_locations(m, n))))

##PLACEHOLDERS FOR TRAINING INPUTS
#We need to assign them as attributes to self, since they
#will be fed in during training

#The training vector
self._vect_input = tf.placeholder("float", [dim])
#Iteration number
self._iter_input = tf.placeholder("float")

##CONSTRUCT TRAINING OP PIECE BY PIECE
#Only the final, 'root' training op needs to be assigned as
#an attribute to self, since all the rest will be executed
#automatically during training

#To compute the Best Matching Unit given a vector
#Basically calculates the Euclidean distance between every
#neuron's weightage vector and the input, and returns the
#index of the neuron which gives the least value
bmu_index = tf.argmin(tf.sqrt(tf.reduce_sum(
tf.pow(tf.sub(self._weightage_vects, tf.pack(
[self._vect_input for i in range(m*n)])), 2), 1)),
0)

#This will extract the location of the BMU based on the BMU's
#index
np.array([[0, 1]]))
bmu_loc = tf.reshape(tf.slice(self._location_vects, slice_input,
tf.constant(np.array([1, 2]))),
[2])

#To compute the alpha and sigma values based on iteration
#number
learning_rate_op = tf.sub(1.0, tf.div(self._iter_input,
self._n_iterations))
_alpha_op = tf.mul(alpha, learning_rate_op)
_sigma_op = tf.mul(sigma, learning_rate_op)

#Construct the op that will generate a vector with learning
#rates for all neurons, based on iteration number and location
#wrt BMU.
bmu_distance_squares = tf.reduce_sum(tf.pow(tf.sub(
self._location_vects, tf.pack(
[bmu_loc for i in range(m*n)])), 2), 1)
neighbourhood_func = tf.exp(tf.neg(tf.div(tf.cast(
bmu_distance_squares, "float32"), tf.pow(_sigma_op, 2))))
learning_rate_op = tf.mul(_alpha_op, neighbourhood_func)

#Finally, the op that will use learning_rate_op to update
#the weightage vectors of all neurons based on a particular
#input
learning_rate_multiplier = tf.pack([tf.tile(tf.slice(
learning_rate_op, np.array([i]), np.array([1])), [dim])
for i in range(m*n)])
weightage_delta = tf.mul(
learning_rate_multiplier,
tf.sub(tf.pack([self._vect_input for i in range(m*n)]),
self._weightage_vects))
weightage_delta)
self._training_op = tf.assign(self._weightage_vects,
new_weightages_op)

##INITIALIZE SESSION
self._sess = tf.Session()

##INITIALIZE VARIABLES
init_op = tf.initialize_all_variables()
self._sess.run(init_op)

def _neuron_locations(self, m, n):
"""
Yields one by one the 2-D locations of the individual neurons
in the SOM.
"""
#Nested iterations over both dimensions
#to generate all 2-D locations in the map
for i in range(m):
for j in range(n):
yield np.array([i, j])

def train(self, input_vects):
"""
Trains the SOM.
'input_vects' should be an iterable of 1-D NumPy arrays with
dimensionality as provided during initialization of this SOM.
Current weightage vectors for all neurons(initially random) are
taken as starting conditions for training.
"""

#Training iterations
for iter_no in range(self._n_iterations):
#Train with each vector one by one
for input_vect in input_vects:
self._sess.run(self._training_op,
feed_dict={self._vect_input: input_vect,
self._iter_input: iter_no})

#Store a centroid grid for easy retrieval later on
centroid_grid = [[] for i in range(self._m)]
self._weightages = list(self._sess.run(self._weightage_vects))
self._locations = list(self._sess.run(self._location_vects))
for i, loc in enumerate(self._locations):
centroid_grid[loc[0]].append(self._weightages[i])
self._centroid_grid = centroid_grid

self._trained = True

def get_centroids(self):
"""
Returns a list of 'm' lists, with each inner list containing
the 'n' corresponding centroid locations as 1-D NumPy arrays.
"""
if not self._trained:
raise ValueError("SOM not trained yet")
return self._centroid_grid

def map_vects(self, input_vects):
"""
Maps each input vector to the relevant neuron in the SOM
grid.
'input_vects' should be an iterable of 1-D NumPy arrays with
dimensionality as provided during initialization of this SOM.
Returns a list of 1-D NumPy arrays containing (row, column)
info for each input vector(in the same order), corresponding
to mapped neuron.
"""

if not self._trained:
raise ValueError("SOM not trained yet")

to_return = []
for vect in input_vects:
min_index = min([i for i in range(len(self._weightages))],
key=lambda x: np.linalg.norm(vect-
self._weightages[x]))
to_return.append(self._locations[min_index])

```

A few points about the code:

1) Since my post on K-Means Clustering, I have gotten more comfortable with matrix operations in TensorFlow. You need to be comfortable with matrices if you want to work with TensorFlow (or any data flow infrastructure for that matter, even SciPy). You can code pretty much any logic or operational flow with TensorFlow, you just need to be able to build up complex functionality from basic components(ops), and structure the flow of data(tensors/variables) well.

2) It took quite a while for me to build the whole graph in such a way that the entire training functionality could be enclosed in a single op. This op is called during each iteration, for every vector, during training. Such an implementation is more in line with TensorFlow’s way of doing things, than my previous attempt with clustering.

3) I have used a 2-D grid for the SOM, you can use any geometry you wish. You would just have to modify the ` _neuron_locations ` method appropriately, and also the method that returns the centroid outputs. You could return a `dict` that maps neuron location to the corresponding cluster centroid.

4) To keep things simple, I haven’t provided for online training. You could do that by having bounds for the learning rate(s).

Sample Usage

I have used PyMVPA’s example of RGB colours to confirm that the code does work. PyMVPA provides functionality to train SOMs too (along with many other learning techniques).

Here’s how you would do it with my code:

```
#For plotting the images
from matplotlib import pyplot as plt

#Training inputs for RGBcolors
colors = np.array(
[[0., 0., 0.],
[0., 0., 1.],
[0., 0., 0.5],
[0.125, 0.529, 1.0],
[0.33, 0.4, 0.67],
[0.6, 0.5, 1.0],
[0., 1., 0.],
[1., 0., 0.],
[0., 1., 1.],
[1., 0., 1.],
[1., 1., 0.],
[1., 1., 1.],
[.33, .33, .33],
[.5, .5, .5],
[.66, .66, .66]])
color_names = \
['black', 'blue', 'darkblue', 'skyblue',
'greyblue', 'lilac', 'green', 'red',
'cyan', 'violet', 'yellow', 'white',
'darkgrey', 'mediumgrey', 'lightgrey']

#Train a 20x30 SOM with 400 iterations
som = SOM(20, 30, 3, 400)
som.train(colors)

#Get output grid
image_grid = som.get_centroids()

#Map colours to their closest neurons
mapped = som.map_vects(colors)

#Plot
plt.imshow(image_grid)
plt.title('Color SOM')
for i, m in enumerate(mapped):
plt.text(m[1], m[0], color_names[i], ha='center', va='center',
bbox=dict(facecolor='white', alpha=0.5, lw=0))
plt.show()

```

Here’s a sample of the output you would get (varies each time you train, but the color names should go to the correct locations in the image):

# K-Means Clustering with TensorFlow

Google recently open-sourced its Artificial Intelligence/Numerical Computing library called TensorFlow. TensorFlow was developed by members of the Google Brain team, and has the flexibility to run on a variety of platforms – including GPUs and mobile devices.

TensorFlow’s methodology uses what they called data-flow graphs. Consider the following diagram from the Wikipedia page on Genetic Programming (which could have some interesting applications with TensorFlow, I think):

As you probably understood, the graphical structure is a way of representing a computational expression in the form of a Tree. Every node is an operation (TensorFlow calls them ops, short for operations). The non-leaf nodes are pretty easy to understand. Some leaf nodes are a special case of an operation, always ‘returning’ a constant value (like 7 or 2.2 in the Tree). Others (like X or Y) act as placeholders that will be fed in at the time of execution. If you look at the arrows, you will realize that their directions denote the dependencies between outputs of different nodes. Hence, data (TensorFlow calls them Tensors) will flow in the opposite direction along each node – Hence the name TensorFlow. TensorFlow provides other components over this graphical abstraction, like persistent memory elements that retain data (called Variables), and optimization techniques to fine-tune the parameters in these Variables in applications like Neural Networks.

TensorFlow has a powerful Python API. The TensorFlow team has done an awesome job of writing the documentation (which is a little tricky to navigate). If you are completely new to this, heres a few links to get you started (in the order you should visit them):

2. Read this example to get a vague idea of what a TensorFlow code looks like.

3. Now read this explanation of the basic components of TensorFlow. It helps if you read the above example again, or simultaneously.

4. Read this detailed example of using TensorFlow for a common ML problem.

5. Once you have a decent understanding of the basic components and how they work, you can look at the Python docs for reference.

Now here is the code I wrote for K-Means clustering using TensorFlow. As a disclaimer, I will mention that this code is based on my (at the time of writing this) 2-day old understanding of how the library works. If you find any errors or know any optimizations possible, do drop a comment! The code is heavily documented, so do go through in-line docs.

```
import tensorflow as tf
from random import choice, shuffle
from numpy import array

def TFKMeansCluster(vectors, noofclusters):
"""
K-Means Clustering using TensorFlow.
'vectors' should be a n*k 2-D NumPy array, where n is the number
of vectors of dimensionality k.
'noofclusters' should be an integer.
"""

noofclusters = int(noofclusters)
assert noofclusters < len(vectors)

#Find out the dimensionality
dim = len(vectors[0])

#Will help select random centroids from among the available vectors
vector_indices = list(range(len(vectors)))
shuffle(vector_indices)

#GRAPH OF COMPUTATION
#We initialize a new graph and set it as the default during each run
#of this algorithm. This ensures that as this function is called
#multiple times, the default graph doesn't keep getting crowded with
#unused ops and Variables from previous function calls.

graph = tf.Graph()

with graph.as_default():

#SESSION OF COMPUTATION

sess = tf.Session()

##CONSTRUCTING THE ELEMENTS OF COMPUTATION

##First lets ensure we have a Variable vector for each centroid,
##initialized to one of the vectors from the available data points
centroids = [tf.Variable((vectors[vector_indices[i]]))
for i in range(noofclusters)]
##These nodes will assign the centroid Variables the appropriate
##values
centroid_value = tf.placeholder("float64", [dim])
cent_assigns = []
for centroid in centroids:
cent_assigns.append(tf.assign(centroid, centroid_value))

##Variables for cluster assignments of individual vectors(initialized
##to 0 at first)
assignments = [tf.Variable(0) for i in range(len(vectors))]
##These nodes will assign an assignment Variable the appropriate
##value
assignment_value = tf.placeholder("int32")
cluster_assigns = []
for assignment in assignments:
cluster_assigns.append(tf.assign(assignment,
assignment_value))

##Now lets construct the node that will compute the mean
#The placeholder for the input
mean_input = tf.placeholder("float", [None, dim])
#The Node/op takes the input and computes a mean along the 0th
#dimension, i.e. the list of input vectors
mean_op = tf.reduce_mean(mean_input, 0)

##Node for computing Euclidean distances
#Placeholders for input
v1 = tf.placeholder("float", [dim])
v2 = tf.placeholder("float", [dim])
euclid_dist = tf.sqrt(tf.reduce_sum(tf.pow(tf.sub(
v1, v2), 2)))

##This node will figure out which cluster to assign a vector to,
##based on Euclidean distances of the vector from the centroids.
#Placeholder for input
centroid_distances = tf.placeholder("float", [noofclusters])
cluster_assignment = tf.argmin(centroid_distances, 0)

##INITIALIZING STATE VARIABLES

##This will help initialization of all Variables defined with respect
##to the graph. The Variable-initializer should be defined after
##all the Variables have been constructed, so that each of them
##will be included in the initialization.
init_op = tf.initialize_all_variables()

#Initialize all variables
sess.run(init_op)

##CLUSTERING ITERATIONS

#Now perform the Expectation-Maximization steps of K-Means clustering
#iterations. To keep things simple, we will only do a set number of
#iterations, instead of using a Stopping Criterion.
noofiterations = 100
for iteration_n in range(noofiterations):

##EXPECTATION STEP
##Based on the centroid locations till last iteration, compute
##the _expected_ centroid assignments.
#Iterate over each vector
for vector_n in range(len(vectors)):
vect = vectors[vector_n]
#Compute Euclidean distance between this vector and each
#centroid. Remember that this list cannot be named
#'centroid_distances', since that is the input to the
#cluster assignment node.
distances = [sess.run(euclid_dist, feed_dict={
v1: vect, v2: sess.run(centroid)})
for centroid in centroids]
#Now use the cluster assignment node, with the distances
#as the input
assignment = sess.run(cluster_assignment, feed_dict = {
centroid_distances: distances})
#Now assign the value to the appropriate state variable
sess.run(cluster_assigns[vector_n], feed_dict={
assignment_value: assignment})

##MAXIMIZATION STEP
#Based on the expected state computed from the Expectation Step,
#compute the locations of the centroids so as to maximize the
#overall objective of minimizing within-cluster Sum-of-Squares
for cluster_n in range(noofclusters):
#Collect all the vectors assigned to this cluster
assigned_vects = [vectors[i] for i in range(len(vectors))
if sess.run(assignments[i]) == cluster_n]
#Compute new centroid location
new_location = sess.run(mean_op, feed_dict={
mean_input: array(assigned_vects)})
#Assign value to appropriate variable
sess.run(cent_assigns[cluster_n], feed_dict={
centroid_value: new_location})

#Return centroids and assignments
centroids = sess.run(centroids)
assignments = sess.run(assignments)
return centroids, assignments

```

Never, ever, EVER, do something like this:

```
for i in range(100):
x = sess.run(tf.assign(variable1, placeholder))

```

This may seem pretty harmless at first glance, but every time you initialize an op, (like `tf.assign` or even `tf.zeros`, you are adding new ops instances to the default graph. Instead, as shown in the code, define a particular op for each task (however specialized) just once in the code. Then, during every of your iterations, call `sess.run` over the required nodes. To check if you are crowding your graph with unnecessary ops, just print out the value of `len(graph.get_operations())` during every iteration and see if its is increasing. In fact, `sess.run` should be the only way you interact with the graph during every iteration.

As visible on lines 138 and 139, you can call `sess.run` over a list of ops/Variables to return a list of the outputs in the same order.

There are a lot of intricacies of TensorFlow that this code does not go into, such as assigning devices to nodes, Graph collections, dependencies, etc. Thats partially because I am still understanding these aspects one by one. But at first glance, TensorFlow seems to be a pretty powerful and flexible way of doing AI/ML-based computations. I would personally like to explore its applications in developing dependency-based statistical metrics for data – for which I am currently using custom tree-like data structures. Lets hope this gesture by Google does lead to an increase in the applications and research in AI. Cheers!