Nelder-Mead Optimization


The Nelder-Mead method is a heuristic optimization technique, just like Genetic Algorithms or Particle Swarms. Optimization in this context refers to the problem of finding point(s) with the optimal value of an objective function in a search space. Optimal could mean a maximum or minimum value.

Whenever possible, we try to solve an optimization problem with an analytical method. For example, if you have the exact definition of f(x) as a differentiable function of x, you could use simple derivatives to compute the value of x at which f attains the optimal value. However, in many real-world problems, we don’t have such a crystal-clear notion of the function we are optimizing. A simple example would be the value of a stock market portfolio as a function of the stocks (and their weightages) in the portfolio, and time. Here, you have no way to define how the value of the overall portfolio is going to behave, and therefore analytical methods don’t make sense.

Heuristic optimization techniques try to remedy this by having a degree of randomness to them. This means that the same algorithm run with the same problem and the same initial conditions could yield different results with each run. Heuristic methods explore the search space stochastically, and keep a note of the ‘best’/most optimal points they come across on the way. Of course, the behaviour is not totally random, but guided by a set of rules.

The most important thing to remember is that heuristic methods never guarantee that they will find the most optimal solution. They give as an output the best point they came across during their search – which might have been a local optimum for all you know. But in practical scenarios, we usually concern ourselves with finding good solutions anyways- we don’t need them to be the best. In fact, finding the best solution with absolute certainty might be impossible/impractical in most cases (think of the stock market portfolio example again).

The Nelder-Mead method uses a geometrical shape called a simplex as its ‘vehicle’ of sorts to search the domain. This is why the technique is also called the Simplex search method. In layman’s terms, a simplex is the n-dimensional version of a ‘triangle’.

For 1 dimension, its a line.


For 2 dimensions, its a triangle.


For 3 dimensions, its a tetrahedron.


And so on…

The shape doesn’t have to be symmetrical/equilateral in any way. Note that an n-dimensional simplex has (n+1) vertexes.

The Algorithm

Nelder-Mead starts off with a randomly-generated simplex (We will discuss how, later). At every iteration, it proceeds to reshape/move this simplex, one vertex at a time, towards an optimal region in the search space. During each step, it basically tries out one or a few modifications to the current simplex, and chooses one that shifts it towards a ‘better’ region of the domain. In an ideal case, the last few iterations of this algorithm would involve the simplex shrinking inwards towards the best point inside it. At the end, the vertex of the simplex that yields that most optimal objective value, is returned.

The following video of a 2-D Nelder-Mead optimization should give you an intuitive understanding of what happens:

How does it work?

Lets assume we are dealing with an n-dimensional space. The current simplex consists of the following n + 1 points: x_1, x_2, …, x_{(n + 1)}. The function we are trying to optimize(maximize) is f(x). The algorithm takes the following steps during every iteration:

I. Ordering

All the points are ordered/sorted, such that the value of f for the first point is the lowest, and that for the last point is the highest. Let the indices of the first(worst), second(second worst) and last(best) points be h, s, l respectively.

II. Computing the Centroid

Consider all points except the worst (x_h). Compute their centroid(mean) as:

c = \frac{1}{n} \sum_{i \neq h} x_i

III. Transformation

This is the most important part of the whole process – transforming the simplex. Transformation takes place in one of the following ways:

i. Reflection

This is the first type of transformation that is tried out. You compute the ‘reflected’ point as:

x_r = c + \alpha (c - x_h)

\alpha is called the reflection parameter, and is usually equal to 1. x_r is essentially a point on the line joining c and x_h, but located away from x_h. This is done in an attempt to move the simplex in a direction thats away from the sub-optimal region around x_h. For a 2-D problem, heres what reflection looks like:

NelderMead_1Now, if f(x_s) < f(x_r) \leq f(x_l) – that is, that x_r is better than the second-worst point, but not better than the current best, we just replace x_h with x_r in the simplex, and move on to the next iteration.

ii. Expansion

If our luck is great and the reflected point x_r happens to be better than the current best (f(x_r) > f(x_l)), we try to explore the possibility further. In other words, we move a little bit more in the direction of x_r from c, in the hope of finding an even better solution. The expanded point is defined as:

x_e = c + \gamma (x_r - c)

\gamma is called the expansion parameter, and its value in most implementations is 2. Heres what 2-D Expansion looks like:


We then replace x_h with the better of the two points: x_e and x_r in the simplex. This is called ‘greedy optimization‘ – we have replaced the worst point with the better of the two options we had. Theres another variant called ‘greedy expansion‘, which replaces x_h with x_e, as long as f(x_e) > f(x_l). This is done irrespective of whether x_e is better than x_r. The intuition for this, comes from the fact that Expansion always leads to a bigger simplex, which means more exploration of the search space.

iii. Contraction

Suppose our reflection point was worst than x_s (i.e. the second worst point). In that case, we need to realise that the direction defined by x_r may not be the ideal one for moving. Therefore, we end up contracting our simplex.

The contraction point x_c is defined as:

x_c = c + \beta (x_h - c)

\beta is called the contraction parameter, and is usually equal to 0.5. In essence, we try to go against our exploration that we tried with x_r, instead contracting the simplex ‘inwards’. A 2-D contraction looks like this:

NelderMead_3If f(x_c) > f(x_h), which means that the contracted point is better than the current-worst, then we replace x_h with x_c in the simplex. However, if this is not the case, then we go to our last resort: the Shrink contraction.

iv. Shrink contraction

In this case, we redefine the entire simplex. We only keep the best point (x_l), and define all others with respect to it and the previous points. The jth new point, will now be defined as:

x_j = x_l + \delta (x_j - x_l)

\delta is called the shrinkage parameter, and is equal to 0.5 in most implementations. What we are essentially doing with the above definition, is moving each point in the simplex towards the current best point, in the hope of converging onto the best neighbourhood. A 2-D Shrinking transformation looks like:


It is important to note that this transformation is the most expensive of all the transformations, since we have to replace multiple (all but one) points in the simplex. However, it has been found (after multiple experiments), that the shrink transformation rarely happens in practice. As a result, many implementations of Nelder-Mead simply skip this step, doing the Contraction instead.


Termination is usually reached when:

1. A given number of iterations is reached

2. The simplex reaches some minimum ‘size’ limit.

3. The current best solution reaches some favorable limit.

The Initial Simplex

Like in the case of most heuristic techniques, theres no ‘best’ way to do the random initialization for Simplex search. However, heres one that supposedly works well (it is used in the MATLAB implementation of Nelder-Mead):

The initial point, x1, is provided by the user if he has some idea about a possible good solution. If not, its taken to be some random point. This is called the ‘guess‘ vertex. Now, remember we have n dimensions, and we need (n + 1) points – out of which 1 has already been decided. The other end points are initialized with respect to x_1, at a small distance along the direction of each dimension.

The (i + 1)th point x_{(i + 1)}, as defined using the ith dimension unit vector u_i, is given by:

x_{(i + 1)} = x_1 + h(x_1, i) * u_i

h(x_1, i) is defined as:

h(x_1, i) = 0.05 if the coefficient of u_i in the definition of x_1 is non-zero.

h(x_1, i) = 0.00025 if the coefficient of u_i in the definition of x_1 is zero.

Other Points

1. Nelder-Mead is generally used for parameter selection in Machine Learning. In essence, data scientists use the Simplex Method to figure out the optimum parameters for their models. This is mainly because this method tends to optimize the objective function fairly fast, and in an efficient way (especially in implementations where Shrinking is not used).

2. Even though Nelder-Mead tends to optimize the objective fairly fast (with few iterations), it tends to get stuck in local optima. In such cases, it may remain stuck in a sub-optimal neighbourhood for a long time, and the only solution to this is to restart the algorithm. Why this happens is fairly easy to understand. Once the simplex enters a region of local optimum, it will only keep contracting/shrinking, instead of randomly exploring other(far-away) parts of the search space. As a result, its becomes impossible for the algorithm to explore any further.

3. Scipy implementation of Nelder-Mead.


  1. Scholarpedia article
  2. Wikipedia article
  3. Jeff Heaton’s book
  4. This StackOverflow answer

Fuzzy Speciation in Genetic Algorithms using k-d Trees

You would need to know the basics of Genetic Algorithms to understand this post. AI Junkie’s tutorial is a good place to get to upto speed. You would also need to know what k-D Trees are, for which the Wiki article is a good reference.


What is speciation?

Speciation, in simple terms, is an evolutionary process by which new biological species arise. This concept is used in the implementation of evolutionary algorithms (such as GAs) as a form of optimization. To understand speciation, its necessary to understand what we mean by ‘species’.

A species is a group of organisms that – 1) possess some common prominent traits, and 2) can breed/mate with one-another. For example, consider cats and leopards. They share many characteristics, and hence exist in the same ‘family’ of animals. However, since they can’t really mate to provide a cat+leopard offspring, they don’t belong to the same species. The part 2) of the above definition may not be applicable in all cases (such as asexual plants). Nevertheless, it is what is important to us in the context of evolutionary algorithms.

Basically, a genetic algorithm (that implements speciation) will divide the population of candidates into a number of species. Essentially, each species will be a group of solutions that are allowed to crossover with each other. Candidates belonging to different species rarely mate.

Why do we need speciation?

A Genetic Algorithm chiefly deals with two genetic operations: Crossover and Mutation. Mutation works on a single candidate solution by perturbing it in some small way. Crossover, on the other hand, takes two candidate ‘parents’ (selected based on their fitness) and produces offspring(s) that share parts of their genome with the parents. This seems good enough, but it has some serious pitfalls.

Many times, crossover has a pretty high rate of failure (that is, it produces very inferior solutions). The reason for this, in simple terms, is: though both parents may individually have very high levels of fitness, the result of combining random parts of their genome together can be sub-optimal. This tends to happen when the parents come from distant parts of the search space, causing the offspring to lie in some entirely new region that has no optima in the vicinity. For example, imagine you are trying to evolve the fastest mammal. During a crossover, you select a deer (a fast mammal), and a cheetah(the fastest indeed). In this case, even if you could breed their offspring, it may have a random mixture of the anatomical features of its parents thats disastrous from the point of view of speed. In other words, mating fit parents does not guarantee a fit offspring, if the properties of the parents are vastly different.

In cases like these, speciation comes to the rescue. By restricting mating between greatly dissimilar candidates, evolution of sub-par candidates is minimized.

How is speciation implemented?

Usually, the two most common ways of implementing speciation in evolutionary algorithms are: Threshold Speciation and K-Means/Medoids clustering. Both are explained pretty well in Jeff Heaton’s book on Nature-Inspired algorithms in Artificial Intelligence.

The key feature of speciation is the fact that it prevents mating among candidates with highly differing characteristics. This can be thought of as promoting breeding between candidates that lie close to each other in the search-space. To achieve the same effect, I have been using k-d trees to perform nearest-neighbour searches to provide a form of speciation to my implementation of GAs. Its pretty simple(and intuitive) in terms of its approach, doesn’t degrade performance by too much and has been providing good results so far. So, here’s a basic outline of it for you all.

Implementation using k-D Trees

I will outline the algorithm first, and point out the nuances later.

First off, you must have a way of computing the ‘distance’/dissimilarity between candidate solutions. If your genome has a fixed length, you can use Euclidean distance.

Extra attributes required: 1. Starting Temperature (T), and 2. Neighbourhood Value(V). T must be in the range (0. 1). The higher this value, the greater is the tendency to neglect speciation (especially in the earlier iterations). The Neighbourhood Value V denotes the number of nearest candidates that will be considered in searches.


I) The following selection process is repeated in each iteration until the new population size does not reach the required value:

Initial Values: Temperature t = T

i. Pick a random candidate parent based on fitness (using something like Roulette-Wheel selection, or Tournament selection)

ii.a. Pick a random number in (0, 1). If it is lesser than the current value of t, pick another parent randomly using the same selection process used in step i. If not, goto step ii.b.

ii.b. Using the k-D Tree trained over candidates from the current iteration, find the V nearest candidates to the parent selected in step i. Choose one randomly (using the process employed in step i), based on fitness.

iii. Use crossover with parents produced in steps i and ii. Mutate as/when appropriate. Add offspring to the pool of candidates of the next generation.

II) Once the new population grows to the required size, a new k-D Tree is generated using the set of all new candidates for the next iteration.

III) Decrease t for the next iteration.


1) First of all, you need to have a way to compute dissimilarity between candidate solutions (apart from their fitness). As mentioned, if your genome size is constant, you could always use good-old Euclidean distance. If not, you might have to be creative. If you are using Scikit-Learn’s implementation of k-D Trees like me, you could check out the metrics it supports by doing kd_tree.valid_metrics over an instance.

2) Because the selection of the first parent in step i. depends only by the global measure for fitness, parts of the search space that generate good solutions will have higher numbers of candidates in subsequent iterations.

3) Unlike using K-Medoids for speciation, you don’t need to have a strict definition of the species that a given candidate belongs to. Nearest-neighbour searching ensures that a candidate usually mates only with other candidates that are similar to it.

4) As you might have understood, the Temperature parameter only adds some amount of exploratory tendencies to the algorithm. The higher the temperature, the greater is the tendency to ignore speciation. The decrease in t could most likely be done linearly from the user-provided starting point T, to a pre-defined minimum value(like 0.05), across iterations.

Effects on Performance of the GA

As mentioned in the discussion here, it doesn’t make much sense to analytically define the time-complexity of GAs. However, let me give you a rough idea of how things will change. Assume the total size of the population is n. First off, Roulette-Wheel selection (done in step i) takes O(n) time if done in a naive way (going through all candidates in some order one by one). If you do it using the stochastic acceptance method, it takes O(1) time. Usually, this selection will be done twice during every crossover. In case of the algorithm outlined above, for approximately (1-t) fraction of the second crossovers, the Roulette-wheel selection will be replaced by an O(log(n)) search using the k-D Tree and a much smaller (almost O(1)) Roulette Wheel selection amongst the nearest V candidates. Hence, unless your population size is very, very large, this won’t affect time that drastically.

The only significant increase in time will be seen as a result of the k-D Tree learning at the end of each iteration, which would add an O(nlog(n)) time to the time required for generating the next-gen pool of candidates (which is O(n^2) for naive implementations anyways). Since O(n^2 + nlog n) is equivalent to O(n^2), the performance decrease isn’t that significant, provided the size of your population isn’t really high. Even if you were using a highly efficient selection procedure during crossovers, for reasonable population sizes, your total run-time would only be multiplied by a factor less than 2 (mainly because crossover and mutation by themselves take some time too).

Thats all for now! Keep reading and cheers 🙂

Using Circular Self-Organizing Maps to solve the Symmetric TSP


1. Knowing what the Traveling Salesman Problem (TSP) is. The wiki article is a good place to start. The symmetrical form of the problem is where the distance from one city to another is the same in both directions.


2. A decent understanding of what Kohonen/Self-Organizing Maps are. This blog post will point you in the right direction to learn more.

After I wrote my former blog post on implementing SOMs with TensorFlow, someone on Reddit mentioned that there are ways to solve TSPs using them. When I did some research, it appeared that there have been a few attempts to do so. Heres one, and heres another. So in my spare time, I decided to try and have a crack at it myself.

I will describe my approach (along with the code) in a short while, but let me give a disclaimer first. This post in no way claims to have found the ‘best’ way to solve TSPs (ACOs are pretty much the state of the art). It just describes my attempt, and the results I obtained (which were interesting, with respect to the methodology implemented).

I obviously make a few assumptions, and here they are:

1. Every city has a corresponding vector available, denoting its coordinates/location in a 2D/3D space. This source presents many TSP benchmark data-sets in this format. I also tried assigning every city a vector where every dimension is its distance from all the cities(including itself). This gives pretty sub-par results, so I decided not to pursue that. Maybe you could use a dimensionality reduction technique like PCA on these ‘distance’ vectors and get good results.

2. Because of the above point, my method is pretty much restricted to solving the symmetric form of a TSP (atleast the way I have done it).


The Method

Suppose the total number of neurons is N. The total number of cities in question is C. After some trial and error, I set N = 2C. It can be higher/lower as well. The lower the number, the less optimal is the understanding of the solution space. And as you go higher, after a certain point, there seems to be no improvement in the output (but your execution time increases many-fold).

The individual neurons are provided locations around a circle, as follows:

l_i = ( cos(\frac{2i\pi}{N}), sin(\frac{2i\pi}{N}) )

where l_i is the location vector for the i th neuron in a 2-D space. i goes from 0 to N - 1. It is important to not confuse this space with the space in which the cities lie (even though both are 2-dimensional). Lets call one the Neuron-Space, and the other, the City-Space.

Because of the circular nature of this SOM, we inherently have a closed loop-structure to our framework (as is needed by the TSP solution), where neurons in proximity of each other in the Neuron-Space lie near-by in the City-Space. Also, do note how every neuron essentially corresponds to an angle ( \frac{2i\pi}{N}) around the circle. This point would be useful soon.

The initialization of the neuron vectors in City-Space is done by using the mean and standard deviation of the dimensions of the city vectors in City-Space. This is because I observed that the scale of all the dimensions in the city-vectors is pretty much the same. If it isn’t, you might have to resort to building a mean vector and covariance matrix. For training, I use \alpha = 0.3 and \sigma = 0.3 as parameters for the SOM. Putting a constant value of \sigma ensures that the same proportion of neurons around the Best-Matching Unit get affected everytime, irrespective of the total number of neurons being used.

Once the SOM trained, we can say we have the rough ‘skeleton’ for a good route in the City-Space. But how do you convert this skeleton to the appropriate route? By using those angles I just talked about. Consider a city c. Lets say the k nearest neurons to it, are n_1, n_2, ..., n_k.

So, we can construct a vector for every city in Neuron-Space, using this simple formula:

l_c = \sum_k \frac{1}{d_{c, n_k}} l_c

where d_{c, n_k} is the distance between city c and neuron n_k in City-Space. Using this vector and some pretty simple (inverse) trigonometry, you can assign the city an angle around the circle on which the SOM-neurons themselves lie in Neuron-Space. Now all you need to do to come up with an approximately good solution, is just sort the cities with respect to the angles they are assigned. Voila!

Ofcourse, the solution that this algorithm gives you would most likely be in the neighbourhood of a slightly better one, so I use a standard TSP local search heuristic like 3-opt over the solution obtained from my algorithm. What 3-opt essentially does, is change around the solution obtained from some other algorithm over so slightly in some pre-defined ways, in the hope of hitting on a better solution. 3-opt is better than 2-opt for this algorithm. I dint try higher-order k-opt searches.

One interesting possibility about this algorithm, is the fact that it could provide great solutions to solving TSPs for different target cities, without retraining the SOM– as long as they lie in the same City-Space. As long as the set of cities you try to solve for follows (atleast approximately) the trends shown by the dataset used for training, all you would need to do is compute their respective angles (which does not involve re-training), and just apply 3-opt. In short, you could use the same SOM, once trained, to solve for multiple subsets of cities.

The Implementation

This code builds on my Tensorflow-based SOM code. Its pretty rough around the edges, especially the 3-opt search. It was meant just as a proof-of-concept, so feel free to hack around if you like.

import tensorflow as tf
import numpy as np

def _distance(tsp_solution, distance_matrix):
    Total distance for a solution.
    total_distance = 0
    for i in range(len(tsp_solution)-1):
        total_distance += (distance_matrix[tsp_solution[i]][
            tsp_solution[i + 1]])
    total_distance += (distance_matrix[tsp_solution[0]][
    return total_distance

def _3_opt(route, distance_matrix):
    Uses 3-opt to find a better solution in the neighbourhood.

    current_dist = _distance(route, distance_matrix)
    #Find three connections to break
    for i in range(len(route) - 1):
        for j in range(i + 2, len(route) - 1):
            for k in range(j + 2, len(route) - 1):
                #Now reconnect the graph in every other way possible,
                #and see if any gives a better solution
                #Way 1
                temp = route[:]
                temp[j+1:k+1] = reversed(route[j+1:k+1])
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 2
                temp = route[:]
                temp[i+1:j+1] = reversed(route[i+1:j+1])
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 3
                temp = route[:]
                temp[i+1:j+1], temp[j+1:k+1] = (
                    reversed(route[i+1:j+1]), reversed(route[j+1:k+1]))
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 4
                temp = (route[:i+1] + route[j+1:k+1] +
                        route[i+1:j+1] + route[k+1:])
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 5
                temp = route[:i+1] + route[j+1:k+1]
                temp += reversed(route[i+1:j+1])
                temp += route[k+1:]
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 6
                temp = route[:i+1]
                temp += reversed(route[j+1:k+1])
                temp += route[i+1:j+1]
                temp += route[k+1:]
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 7
                temp = route[:i+1]
                temp += reversed(route[j+1:k+1])
                temp += reversed(route[i+1:j+1])
                temp += route[k+1:]
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
    return route

class TSPSolver(object):
    Uses a Circular Self-Organizing Map with a Gaussian Neighbourhood
    function and linearly decreasing learning rate, to solve TSP.
    #To check if the SOM has been trained
    _trained = False
    def __init__(self, input_vects, n_iterations=200):
        Initializes all necessary components of the TensorFlow

        'input_vects' should be a list of 1-D NumPy vectors denoting
        location vectors for the cities.
        'n_iterations' is an integer denoting the number of iterations
        that would be run while training the SOM.
        #Assign required variables first
        self._n = len(input_vects)
        dim = len(input_vects[0])
        self._n_centroids = self._n*3
        alpha = 0.3
        sigma = 0.3
        self._n_iterations = abs(int(n_iterations))

        #For initialization of the centroid(neuron) vectors
        init_values = []
        for vect in input_vects:
        init_mean = np.mean(init_values)
        init_dev = np.sqrt(np.var(init_values))
        self._graph = tf.Graph()
        with self._graph.as_default():
            #Randomly initialized weightage vectors for all neurons,
            #stored together as a matrix Variable of size [self._n, dim]
            self._weightage_vects = tf.Variable(tf.random_normal(
                [self._n_centroids, dim], init_mean, init_dev))
            #Matrix of size [self._n, 2] for SOM grid locations
            #of neurons
            self._location_vects = tf.constant(np.array(
            #We need to assign them as attributes to self, since they
            #will be fed in during training
            #The training vector
            self._vect_input = tf.placeholder("float", [dim])
            #Iteration number
            self._iter_input = tf.placeholder("float")
            #Only the final, 'root' training op needs to be assigned as
            #an attribute to self, since all the rest will be executed
            #automatically during training
            #To compute the Best Matching Unit given a vector
            #Basically calculates the Euclidean distance between every
            #neuron's weightage vector and the input, and returns the
            #index of the neuron which gives the least value
            bmu_index = tf.argmin(tf.sqrt(tf.reduce_sum(
                tf.pow(tf.sub(self._weightage_vects, tf.pack(
                    [self._vect_input for i in range(
                        self._n_centroids)])), 2), 1)),
            #This will extract the location of the BMU based on the BMU's
            slice_input = tf.pad(tf.reshape(bmu_index, [1]),
                                 np.array([[0, 1]]))
            bmu_loc = tf.reshape(tf.slice(self._location_vects, slice_input,
                                          tf.constant(np.array([1, 2]))),
            #To compute the alpha and sigma values based on iteration
            learning_rate_op = tf.sub(1.0, tf.div(self._iter_input,
            _alpha_op = tf.mul(alpha, learning_rate_op)
            _sigma_op = tf.mul(sigma, learning_rate_op)
            #Construct the op that will generate a vector with learning
            #rates for all neurons, based on iteration number and location
            #wrt BMU.
            bmu_distance_squares = tf.reduce_sum(tf.pow(tf.sub(
                self._location_vects, tf.pack(
                    [bmu_loc for i in range(self._n_centroids)])), 2), 1)
            neighbourhood_func = tf.exp(tf.neg(tf.div(tf.cast(
                bmu_distance_squares, "float32"), tf.pow(_sigma_op, 2))))
            learning_rate_op = tf.mul(_alpha_op, neighbourhood_func)
            #Finally, the op that will use learning_rate_op to update
            #the weightage vectors of all neurons based on a particular
            learning_rate_multiplier = tf.pack([tf.tile(tf.slice(
                learning_rate_op, np.array([i]), np.array([1])), [dim])
                                               for i in range(
            weightage_delta = tf.mul(
                tf.sub(tf.pack([self._vect_input for i in range(
            new_weightages_op = tf.add(self._weightage_vects,
            self._training_op = tf.assign(self._weightage_vects,
            self._sess = tf.Session()
            init_op = tf.initialize_all_variables()

            #Train the Solver

    def solution(self):
        Returns the TSP solution order as a list, and the total
        Euclidean distance with respect to the input vectors.
        if not self._trained:
            raise ValueError("SOM not trained yet")
        return self._sol, self._sol_dist

    def _neuron_locations(cls, n_centroids):
        Yields one by one the 2-D locations of the individual neurons
        in the SOM.

        for i in range(n_centroids):
            yield np.array([np.cos(i*2*np.pi/float(n_centroids)),
    def train(self, input_vects):
        Trains the SOM.
        'input_vects' should be a list of 1-D NumPy arrays with
        dimensionality as provided during initialization of this SOM.
        The points in the solution are enumerated as they are ordered
        in 'input_vects'.
        Current weightage vectors for all neurons(initially random) are
        taken as starting conditions for training.
        #Training iterations
        for iter_no in range(self._n_iterations):
            #Train with each vector one by one
            for input_vect in input_vects:
                               feed_dict={self._vect_input: input_vect,
                                          self._iter_input: iter_no})
        self._input_vects = input_vects
        self._trained = True
    def _compute_solution(self):
        Computes the solution to the TSP with respect to the input
        #TODO: All nearest-neighbour searches could be speeded up
        #significantly with a KD-Tree/Ball Tree.

        #Get the centroid data 
        centroid_vects = list(
        centroid_locations = list(

        #Distance matrix mapping input point number to list of distances
        #from centroids
        distances = {}

        for point in range(self._n):
            distances[point] = []
            for centroid in range(self._n_centroids):
                    np.linalg.norm(centroid_vects[centroid] -

        #Distance matrix mapping input point number to list of distances
        #from other input points
        point_distances = {}

        for point in range(self._n):
            point_distances[point] = []
            for other_point in range(self._n):
                    np.linalg.norm(self._input_vects[other_point] -

        #Compute angle with respect to each city(point)
        point_angles = {}
        for point in range(self._n):
            total_vect = 0
            cents = [j for j in range(self._n_centroids)]
            cents.sort(key=lambda x: distances[point][x])
            for centroid in cents[:2]:
                total_vect += (1.0/(distances[point][centroid]) *
            total_vect = total_vect/np.linalg.norm(total_vect)
            if total_vect[0] > 0 and total_vect[1] > 0:
                angle = np.arcsin(total_vect[1])
            elif total_vect[0] < 0 and total_vect[1] > 0:
                angle = np.arccos(total_vect[0])
            elif total_vect[0] < 0 and total_vect[1] < 0:
                angle = np.pi - np.arcsin(total_vect[1])
                angle = 2*np.pi - np.arccos(total_vect[0])
            point_angles[point] = angle

        #Find the rough solution
        tsp_solution = [i for i in range(self._n)]
        tsp_solution.sort(key=lambda x: point_angles[x])

        tsp_solution = _3_opt(tsp_solution, point_distances)

        #Compute the total distance for the solution
        total_distance = _distance(tsp_solution, point_distances)

        self._sol = tsp_solution
        self._sol_dist = total_distance


Each of these problem datasets can be found at the source here. All the presented results are over 10 runs. The max runtime was experienced with krA100, around one minute.

Berlin-52: Best Error: 0.0%, Average Error: 4.1%.

krA100: Best Error: 1.6%, Average Error: 2.7%.

eli76: Best Error: 2%, Average Error: 4.4%.

These results aren’t ‘close’ to those provided by ACOs, but they are practically feasible, and show a good promise (Oh now I just sound like I am writing a research paper.) Plus, the algorithm isn’t a time-hogger either (judging by runtimes on my laptop).

Anyways. Thanks for reading! Do drop a comment if you think of an improvement over this work.

Backpropagation for dummies

Backpropagation is a method of training an Artificial Neural Network. If you are reading this post, you already have an idea of what an ANN is. However, lets take a look at the fundamental component of an ANN- the artificial neuron.


The figure shows the working of the ith neuron (lets call it N_i) in an ANN. Its output (also called its activation) is a_i. The jth neuron N_j provides one of the inputs to N_i.

How does N_i produce its own activation?

1. N_i has stored in itself, a weightage assigned to each of its inputs. Lets say that the weightage assigned by N_i to N_j is w_{i, j}. As the first step, N_i computes a weighted sum of all its inputs. Lets call it in_i. Therefore,

in_i = \sum_{k \in Inputs(N_i)} w_{i, k}a_k … (1)

where Inputs(N_i) corresponds to all neurons that provide an input to N_i.

2. This sum, called the input function, is then passed on to N_i‘s activation function. It is called so, because it computes the activation that will be provided by N_i to the units it in turn acts as an input to. Therefore,

a_i = g(in_i) … (2)

Usually, the activation functions of all neurons in an ANN are the same, so we just write g instead of g_i.

There are many options for g, such as linear functions, rectified linear functions, hyperbolic tan, etc. For the sake of this tutorial, we will use the sigmoid function (I’ll explain why later). So we have

g(x) = \frac{1}{1 + e^{-x}} … (3)


Thus, every neuron in an ANN – except the ones in the lowest-level input layer, and bias neurons – takes in the activation of all its inputs, and provides its own activation as an output.

Once a neural network has been designed, it is the job of the training process to ensure that the individual weightages w_{i, j} given by each neuron to each of its inputs is set just right, so that the whole network gives the right output. Backpropagation is one of the ways to optimize those weights.

Whenever we talk of optimization, it is necessary to properly define what we are optimizing and how we are doing it.

What are we trying to optimize?

The error function. This function defines how much the output of our model (in this case an ANN) defers from the required output. For the sake of this tutorial, we will use the Mean-Squared Error(MSE) function. For example, let N_o be one of the output neurons of our ANN. For a given training sample, let the expected output be t_o. However, N_o provides an output equal to its activation value, that is a_o. Therefore, the error as defined by MSE would be

E = \frac{1}{2} (t_o - a_o)^2 … (4)

Since a_o by its definition depends on the weightages given by N_o to its inputs, the ultimate valye of E depends on those weightages too.

It is important to remember that every training sample will result in a different error value. It is backpropagation’s goal to minimize (bring as close to zero as possible) the value of E for all training samples on an average. The way this is done, as mentioned before, is by fine-tuning the values of the weights w_{i, j}.

But how does backpropagation fine tune these weightage values? By using a technique called Gradient Descent.

What is Gradient Descent?

Consider a function as follows:

f(x) = 3x + 4

The gradient of the function f with respect to the variable x defines how much the value of f will change with a unit increase/decrease in the value of x. For the above example, the answer to that is 3. Basically, you are differentiating f with respect to x. Generally speaking, a function may be dependent on a variety of parameters, so you take a partial derivative when computing the gradient.

Consider you are on a hill like the one shown in the image below:

Gradient_descent_1The red regions correspond to places of higher function value, while the blue regions correspond to low values. You want to reach the lowest point possible. If you are standing on the red hill, and you can only see a small part of the terrain around you, what will you do? You will naturally take steps in the direction in which the terrain slopes down. With each step you take, you will scan your immediate neighbourhood (which is visible to you), and go in the direction that shows the steepest descent.

In mathematical terms, this ‘direction’ is nothing but the gradient of your height (the target function you are trying to minimize), with respect to your location (x, y) (which are the parameters you are fine tuning). The overall direction in which you move, will be dependent on the gradients with respect to both dimensions, i.e. \frac{\partial height}{\partial x} and \frac{\partial height}{\partial y} . That’s pretty much what gradient descent does.

Its called so, because you are descending from a higher value of the target function to a lower value, by following the direction that corresponds to the lowest gradient (and consequently the steepest decrease in function value).

One thing to note is that you would always want to move in the direction shown by the negation of the gradient. Consider the above example of f(x) = 3x + 4 . The gradient here is 3, which is positive. So, with every unit increase in x, f(x) would increase by 3. Therefore, to reduce f(x), you would decrease the value of x . The opposite is true in scenarios where you want to maximize the target function. In those cases, you move in the direction of the gradient itself.

Coming back to the context of ANN training, backpropagation reduces the error function E by taking help of the gradients with respect to individual weights, \frac{\partial E}{\partial w_{i, j}} .

How does backpropagation employ gradient descent?

First off, it is necessary to keep in mind that backpropagation tries to optimize the inter-neuron weights, and not the neurons themselves. Their functionality is pretty much decided when you decide the activation function.

Secondly, the error function value E depends on the current values of the weights and the training sample in question, apart from the definition of the activation function. So to compute the error with respect to a training sample, you first feed it as an input to the network. You will then receive the appropriate output(s) from the output neuron(s). This part is known as  ‘forward propagation‘, since you are going from the input layer to the output layer (through the hidden layers if any). If you have multiple output neurons, the error values corresponding to each will affect the weights appropriately.

Then, you use the network output(s) along with the expected value(s) to calculate the errors using Equation 4 given above. Once the error values are known, the weights are updated using the gradients. This updating occurs first for the out-most weights, and then for lower layers. The intuition behind this, is the fact that the errors with respect to the outmost neurons are known first, and you derive the errors of the inner neurons from them. Each layer’s errors are calculated from the errors of the neurons they provide input to. That is why the second phase is called ‘backward propagation‘, or back-propagation.

So as a first step, lets try to understand how we will calculate the gradient \frac{\partial E}{\partial w_{i, j}} for a given training sample. Remember that  w_{i, j} is the weightage given by N_i to N_j.

Using the chain rule in Leibniz’s form,

\frac{\partial E}{\partial w_{i, j}} = \frac{\partial E}{\partial in_i} \frac{\partial in_i}{\partial w_{i, j}} … (5)

From Equations 1 and 5, and using basic differentiation,

\frac{\partial E}{\partial w_{i, j}} = a_j \frac{\partial E}{\partial in_i} … (6)

From the basic definition of the activation function (and the appropriate current weight values), you already know the value of a_j. Now consider the second term in the above product. Using the chain rule again on it, we get

\frac{\partial E}{\partial in_i} = \frac{\partial E}{\partial a_i} \frac{\partial a_i}{\partial in_i} … (7)

This is where the whole advantage of using the sigmoid activation function comes into play. The derivative of a sigmoid function with respect to its input, corresponds to a nice expression. Given equation 3, you can easily verify that

\frac{dg}{dx} = g(x)(1 - g(x))  … (8)

From Equations 7 and 8,

\frac{\partial E}{\partial in_i} = a_i (1 - a_i) \frac{\partial E}{\partial a_i} … (9)

Putting the information in Equations 6 and 9 together,

\frac{\partial E}{\partial w_{i, j}} = a_j a_i (1 - a_i) \frac{\partial E}{\partial a_i} … (10)

Look at the last term. There are two possible cases for N_i (N_i cannot be an input neuron, since it is receiving input from another neuron) :

I. N_i is an output neuron. In this case, the second term on the RHS is pretty simple, from Equation 4-

\frac{\partial E}{\partial a_i} = - (t_i - a_i) … (11)

where t_i is the expected output from N_i (for the current training sample).

2. N_i is an inner neuron. Consider all neurons N_k, for whom N_i acts an an input. Since we are propagating backward, you can therefore assume that we know the values \frac{\partial E}{\partial a_k} for all of them. In fact, since the output of N_i affects the outputs of all these N_k s, the expression for  \frac{\partial E}{\partial a_i} will involve all of them.

Using simple chain rule,

\frac{\partial E}{\partial a_i} = \sum_{k} \frac{\partial in_k}{\partial a_i} \frac{\partial E}{\partial in_k} … (12)

From Equations 12, 1 and 9,

\frac{\partial E}{\partial a_i} = \sum_{k} w_{k, i} a_k (1 - a_k) \frac{\partial E}{\partial a_k} … (13)

Equations 10, 11 and 13 pretty much define the gradient values \frac{\partial E}{\partial w_{i, j}} for all weights in a given ANN. So how are the weights actually updated?

Using the following rule:

The change caused by a given training sample, at time instance ‘t’, for the weight w_{i, j} is given by

\Delta w_{i, j}^{t} = - \epsilon \frac{\partial E}{\partial w_{i, j}} + \alpha \Delta w_{i, j}^{t - 1}

Remember me mentioning that we always need to move in the direction of the gradient’s negation? Thats why the minus sign for the first term. The value \epsilon is called the learning rate. We should never be too hasty in changing the weight’s value too drastically based on the gradient with respect to one training sample. Therefore, the gradient value is multiplied by the learning rate (usually well below 0.5) before causing the required change in the weightage value. If the learning rate is too low, the algorithm takes too long to train. On the other hand, if its too high, the algorithm keeps ‘bouncing around’ in the search space. Imagine you are running towards point B from point A. But you run so fast that you cant stop in time at point B, but you actually end up far ahead of it. Then you again run in the opposite direction towards B, once again crossing it and stopping ahead because of your high speed. Thats pretty much what happens if your learning rate is too large.

A variant of backpropagation, called resilient propagation(RPROP) improves on this classic method by having custom deltas with respect to each weightage in a network. It only uses the gradients as references to determine which direction to move in. Therefore, you don’t have to provide the learning rate as a parameter to resilient propagation.

Now look at the second term in the RHS above. The value \alpha is called the momentum. Sometimes, gradient descent tends to get stuck in local optima in the search space. A local optimum is a point where the value is low compared to the region around it, but not all of the domain overall. The momentum term causes the current weight update to be dependent on the last weight update (and therefore not be drastically different), thus preventing getting stuck in local optima in most cases. If it is indeed a global optimum, the search will eventually come back to the required point as the momentum will go on reducing. Nesterov Momentum is another way of optimizing the use of momentum in gradient descent.

How frequently are the weights updated?

There are 3 answers to this:

1. Online Mode

In this mode, the weights are updated after every training sample. In other words, the forward propagation as well as the backward propagation phases are run for each training sample, every single time. This mode is appropriate if you are receiving your training data sequentially. However, if your training set is large, this mode of training can be pretty time-consuming. Moreover, it is not appropriate to update weights after every sample (in most cases), due to the possible presence of outliers.

2. Batch Mode

In this mode, the weights are updated after accumulating gradients with respect to all samples in the training set. In other words, the forward propagation is run with respect to all samples, and the backward phase is run just once with respect to the accumulated results of the entire batch. This is done for multiple iterations (obviously). The batch mode of training is thus more resistant to variance in the training dataset.

3. Stochastic Mode

The stochastic mode adds randomization to the mix. In the mini-batch method of stochastic learning, small sets of samples are chosen randomly from the dataset, and training is done in batch mode over these mini-batches. Stochastic mode with mini-batches provides a nice trade-off between the pros and cons of online and batch modes of training. Moreover, since you are randomly choosing the mini batches (instead of organizing your dataset into pre-defined subsets), you avoid getting stuck in local optima. This is mainly because you are providing weight updates with respect to a potentially different subset of training samples during every iteration.

EDIT 14/01/2016

Xavier’s initialization

Usually, the weights in a Neural Network are initialized to random values prior to training. This is okay if your network is shallow. But for deep learning applications, where the number of hidden layers may be 3 or more, a wrong selection of initial weights can cause convergence to sub-optimal models.

For example, assume that you are using a sigmoid activation function and your initial weights are pretty small (in the 0-1 range). If you manually calculate the activation values in a neural network across 1-2 layers, you will realize that almost all neurons in the topmost layer give activations around 0.5 (the mean value for the sigmoid function). This phenomenon is called ‘saturation’. Its described well in this paper by Glorot and Bengio.

The solution to this problem, is Xavier’s initialization.  This technique tries to optimize the generation of initial weights, by applying simple statistics over the number of input and output connections corresponding to a neuron. A good intuitive explanation of the method is given at Andy’s blog. As a note to my future self, here’s the TensorFlow implementation of Xavier’s algorithm.

Self-Organizing Maps with Google’s TensorFlow

[This post assumes that you know the basics of Google’s TensorFlow library. If you don’t, have a look at my earlier post to get started.]


A Self-Organizing Map, or SOM, falls under the rare domain of unsupervised learning in Neural Networks. Its essentially a grid of neurons, each denoting one cluster learned during training. Traditionally speaking, there is no concept of neuron ‘locations’ in ANNs. However, in an SOM, each neuron has a location, and neurons that lie close to each other represent clusters with similar properties. Each neuron has a weightage vector, which is equal to the centroid of its particular cluster.

AI-Junkie’s post does a great job of explaining how an SOM is trained, so I won’t re-invent the wheel.

The Code

Here’s my code for a 2-D version of an SOM. Its written with TensorFlow as its core training architecture: (Its heavily commented, so look at the inline docs if you want to hack/dig around)

import tensorflow as tf
import numpy as np

class SOM(object):
    2-D Self-Organizing Map with Gaussian Neighbourhood function
    and linearly decreasing learning rate.

    #To check if the SOM has been trained
    _trained = False

    def __init__(self, m, n, dim, n_iterations=100, alpha=None, sigma=None):
        Initializes all necessary components of the TensorFlow

        m X n are the dimensions of the SOM. 'n_iterations' should
        should be an integer denoting the number of iterations undergone
        while training.
        'dim' is the dimensionality of the training inputs.
        'alpha' is a number denoting the initial time(iteration no)-based
        learning rate. Default value is 0.3
        'sigma' is the the initial neighbourhood value, denoting
        the radius of influence of the BMU while training. By default, its
        taken to be half of max(m, n).

        #Assign required variables first
        self._m = m
        self._n = n
        if alpha is None:
            alpha = 0.3
            alpha = float(alpha)
        if sigma is None:
            sigma = max(m, n) / 2.0
            sigma = float(sigma)
        self._n_iterations = abs(int(n_iterations))

        self._graph = tf.Graph()

        with self._graph.as_default():


            #Randomly initialized weightage vectors for all neurons,
            #stored together as a matrix Variable of size [m*n, dim]
            self._weightage_vects = tf.Variable(tf.random_normal(
                [m*n, dim]))

            #Matrix of size [m*n, 2] for SOM grid locations
            #of neurons
            self._location_vects = tf.constant(np.array(
                list(self._neuron_locations(m, n))))

            #We need to assign them as attributes to self, since they
            #will be fed in during training

            #The training vector
            self._vect_input = tf.placeholder("float", [dim])
            #Iteration number
            self._iter_input = tf.placeholder("float")

            #Only the final, 'root' training op needs to be assigned as
            #an attribute to self, since all the rest will be executed
            #automatically during training

            #To compute the Best Matching Unit given a vector
            #Basically calculates the Euclidean distance between every
            #neuron's weightage vector and the input, and returns the
            #index of the neuron which gives the least value
            bmu_index = tf.argmin(tf.sqrt(tf.reduce_sum(
                tf.pow(tf.sub(self._weightage_vects, tf.pack(
                    [self._vect_input for i in range(m*n)])), 2), 1)),

            #This will extract the location of the BMU based on the BMU's
            slice_input = tf.pad(tf.reshape(bmu_index, [1]),
                                 np.array([[0, 1]]))
            bmu_loc = tf.reshape(tf.slice(self._location_vects, slice_input,
                                          tf.constant(np.array([1, 2]))),

            #To compute the alpha and sigma values based on iteration
            learning_rate_op = tf.sub(1.0, tf.div(self._iter_input,
            _alpha_op = tf.mul(alpha, learning_rate_op)
            _sigma_op = tf.mul(sigma, learning_rate_op)

            #Construct the op that will generate a vector with learning
            #rates for all neurons, based on iteration number and location
            #wrt BMU.
            bmu_distance_squares = tf.reduce_sum(tf.pow(tf.sub(
                self._location_vects, tf.pack(
                    [bmu_loc for i in range(m*n)])), 2), 1)
            neighbourhood_func = tf.exp(tf.neg(tf.div(tf.cast(
                bmu_distance_squares, "float32"), tf.pow(_sigma_op, 2))))
            learning_rate_op = tf.mul(_alpha_op, neighbourhood_func)

            #Finally, the op that will use learning_rate_op to update
            #the weightage vectors of all neurons based on a particular
            learning_rate_multiplier = tf.pack([tf.tile(tf.slice(
                learning_rate_op, np.array([i]), np.array([1])), [dim])
                                               for i in range(m*n)])
            weightage_delta = tf.mul(
                tf.sub(tf.pack([self._vect_input for i in range(m*n)]),
            new_weightages_op = tf.add(self._weightage_vects,
            self._training_op = tf.assign(self._weightage_vects,

            self._sess = tf.Session()

            init_op = tf.initialize_all_variables()

    def _neuron_locations(self, m, n):
        Yields one by one the 2-D locations of the individual neurons
        in the SOM.
        #Nested iterations over both dimensions
        #to generate all 2-D locations in the map
        for i in range(m):
            for j in range(n):
                yield np.array([i, j])

    def train(self, input_vects):
        Trains the SOM.
        'input_vects' should be an iterable of 1-D NumPy arrays with
        dimensionality as provided during initialization of this SOM.
        Current weightage vectors for all neurons(initially random) are
        taken as starting conditions for training.

        #Training iterations
        for iter_no in range(self._n_iterations):
            #Train with each vector one by one
            for input_vect in input_vects:
                               feed_dict={self._vect_input: input_vect,
                                          self._iter_input: iter_no})

        #Store a centroid grid for easy retrieval later on
        centroid_grid = [[] for i in range(self._m)]
        self._weightages = list(
        self._locations = list(
        for i, loc in enumerate(self._locations):
        self._centroid_grid = centroid_grid

        self._trained = True

    def get_centroids(self):
        Returns a list of 'm' lists, with each inner list containing
        the 'n' corresponding centroid locations as 1-D NumPy arrays.
        if not self._trained:
            raise ValueError("SOM not trained yet")
        return self._centroid_grid

    def map_vects(self, input_vects):
        Maps each input vector to the relevant neuron in the SOM
        'input_vects' should be an iterable of 1-D NumPy arrays with
        dimensionality as provided during initialization of this SOM.
        Returns a list of 1-D NumPy arrays containing (row, column)
        info for each input vector(in the same order), corresponding
        to mapped neuron.

        if not self._trained:
            raise ValueError("SOM not trained yet")

        to_return = []
        for vect in input_vects:
            min_index = min([i for i in range(len(self._weightages))],
                            key=lambda x: np.linalg.norm(vect-

        return to_return

A few points about the code:

1) Since my post on K-Means Clustering, I have gotten more comfortable with matrix operations in TensorFlow. You need to be comfortable with matrices if you want to work with TensorFlow (or any data flow infrastructure for that matter, even SciPy). You can code pretty much any logic or operational flow with TensorFlow, you just need to be able to build up complex functionality from basic components(ops), and structure the flow of data(tensors/variables) well.

2) It took quite a while for me to build the whole graph in such a way that the entire training functionality could be enclosed in a single op. This op is called during each iteration, for every vector, during training. Such an implementation is more in line with TensorFlow’s way of doing things, than my previous attempt with clustering.

3) I have used a 2-D grid for the SOM, you can use any geometry you wish. You would just have to modify the _neuron_locations method appropriately, and also the method that returns the centroid outputs. You could return a dict that maps neuron location to the corresponding cluster centroid.

4) To keep things simple, I haven’t provided for online training. You could do that by having bounds for the learning rate(s).

Sample Usage

I have used PyMVPA’s example of RGB colours to confirm that the code does work. PyMVPA provides functionality to train SOMs too (along with many other learning techniques).

Here’s how you would do it with my code:

#For plotting the images
from matplotlib import pyplot as plt

#Training inputs for RGBcolors
colors = np.array(
     [[0., 0., 0.],
      [0., 0., 1.],
      [0., 0., 0.5],
      [0.125, 0.529, 1.0],
      [0.33, 0.4, 0.67],
      [0.6, 0.5, 1.0],
      [0., 1., 0.],
      [1., 0., 0.],
      [0., 1., 1.],
      [1., 0., 1.],
      [1., 1., 0.],
      [1., 1., 1.],
      [.33, .33, .33],
      [.5, .5, .5],
      [.66, .66, .66]])
color_names = \
    ['black', 'blue', 'darkblue', 'skyblue',
     'greyblue', 'lilac', 'green', 'red',
     'cyan', 'violet', 'yellow', 'white',
     'darkgrey', 'mediumgrey', 'lightgrey']

#Train a 20x30 SOM with 400 iterations
som = SOM(20, 30, 3, 400)

#Get output grid
image_grid = som.get_centroids()

#Map colours to their closest neurons
mapped = som.map_vects(colors)

plt.title('Color SOM')
for i, m in enumerate(mapped):
    plt.text(m[1], m[0], color_names[i], ha='center', va='center',
             bbox=dict(facecolor='white', alpha=0.5, lw=0))

Here’s a sample of the output you would get (varies each time you train, but the color names should go to the correct locations in the image):


Recursively copying elements from one Graph to another in TensorFlow

In my previous post on Google’s TensorFlow, I had mentioned the idea of using the library for Genetic Programming applications. In my free time, I tried to map out how such an application would be run. The focus obviously wasn’t on building the smartest way to do GP, but rather on exploring if it was practically possible. One of the ideas that stuck out, was to have a Graph for each candidate solution – which would be populated based on cross-overed elements from the parent solutions’ Graphs. This would require a good API to copy computational elements from one Graph to another in TensorFlow. I tried digging around to see if such functionality was available, but couldn’t find any (atleast from a good exploring of the github repo).

I wrote some rudimentary code to accomplish this, and heres a basic outline of how it works:

Consider the example given on TensorFlow’s Get Started page.

import tensorflow as tf
import numpy as np

# Make 100 phony data points in NumPy.
x_data = np.float32(np.random.rand(2, 100)) # Random input
y_data =[0.100, 0.200], x_data) + 0.300

# Construct a linear model.
b = tf.Variable(tf.zeros([1]))
W = tf.Variable(tf.random_uniform([1, 2], -1.0, 1.0))
y = tf.matmul(W, x_data) + b

# Minimize the squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)

# For initializing the variables.
init = tf.initialize_all_variables()

# Launch the graph
sess = tf.Session()

# Fit the plane.
for step in xrange(0, 201):
    if step % 20 == 0:
        print step,,

# Learns best fit is W: [[0.100  0.200]], b: [0.300]

Lets say you want to copy the main training element train to another graph as defined here:

to_graph = tf.Graph()


1) You first decide on a namespace inside which all the copied elements will exist in to_graph. This is not really required if the Graph you are copying to is empty. But its important to remember that element names matter a lot in TensorFlow’s workings. Therefore, to avoid naming conflicts, its better to define such a namespace. So basically, what would be called “C” in to_graph‘s namespace, would now be called “N/C” where “N” is the namespace String.

Lets assume the namespace we define the copied elements in our example to, is “CopiedOps”.

namespace = "CopiedOps"


2) You then copy all the variables first, one by one, using a dedicated function copy_variable_to_graph. Each time, you supply the original instance, the target graph (to_graph in this case), the namespace, and an extra dictionary called copied_variables. This dictionary is useful while copying the computational nodes (Operation instances, we will call them ops) in the next step. Since Variable instances act as inputs for ops, we need a way to keep track of them for later.

I initially wanted to combine this initialization of variables with the function that copies the computational elements, but I found it really tricky to capture the appropriate Variable instances based on their connections to ops. Anyways, since variables are more like parameter-storing units whose values are needed frequently, its better to initialize them separately.

The Variable instances in the above example are b and W. Heres how you would do it with my code:

copied_variables = {}
b1 = copy_variable_to_graph(b, to_graph, namespace, copied_variables)
W1 = copy_variable_to_graph(W, to_graph, namespace, copied_variables)

Ofcourse, if your code has a lot of variables, you could just store them all in a list and run the above function over all of them with a common dictionary for copied variables.


3) You then recursively copy all the computational nodes (ops, Placeholders) to the other graph. Now heres the nice things about the method- You only need to do it for the topmost node computational node. All connected inputs and Tensors are automatically taken care of!

For the above example, the train object constructed on line 16 is the ‘topmost’ node. So copying the whole learner is as simple as:

train_copy = copy_to_graph(train, to_graph, copied_variables, namespace)

Thats it! The other instances like y, optimizer, loss are automatically replicated in to_graph.

Theres also a helper function in case you want to find the equivalent of an element from the original graph, in to_graph:

loss_copy = get_copied(loss, to_graph, copied_variables, namespace)


4) You can now run the new node in to_graph. Remember to initialize a new Session instance linked to the graph, and initialize all Variables. So heres how you would go about it:

with to_graph.as_default():
    init1 = tf.initialize_all_variables()
    sess1 = tf.Session()

    for step in xrange(0, 201):
        if step % 20 == 0:
            print step,,

This provides an output similar to what you would get from the Get Started original example.

The Code

import tensorflow as tf
from tensorflow.python.framework import ops
from copy import deepcopy

def copy_variable_to_graph(org_instance, to_graph, namespace,
    Copies the Variable instance 'org_instance' into the graph
    'to_graph', under the given namespace.
    The dict 'copied_variables', if provided, will be updated with
    mapping the new variable's name to the instance.

    if not isinstance(org_instance, tf.Variable):
        raise TypeError(str(org_instance) + " is not a Variable")

    #The name of the new variable
    if namespace != '':
        new_name = (namespace + '/' +
        new_name =[':')]

    #Get the collections that the new instance needs to be added to.
    #The new collections will also be a part of the given namespace,
    #except the special ones required for variable initialization and
    collections = []
    for name, collection in org_instance.graph._collections.items():
        if org_instance in collection:
            if (name == ops.GraphKeys.VARIABLES or
                name == ops.GraphKeys.TRAINABLE_VARIABLES or
                namespace == ''):
                collections.append(namespace + '/' + name)

    #See if its trainable.
    trainable = (org_instance in org_instance.graph.get_collection(
    #Get the initial value
    with org_instance.graph.as_default():
        temp_session = tf.Session()
        init_value =

    #Initialize the new variable
    with to_graph.as_default():
        new_var = tf.Variable(init_value,

    #Add to the copied_variables dict
    copied_variables[] = new_var

    return new_var

def copy_to_graph(org_instance, to_graph, copied_variables={}, namespace=""):
    Makes a copy of the Operation/Tensor instance 'org_instance'
    for the graph 'to_graph', recursively. Therefore, all required
    structures linked to org_instance will be automatically copied.
    'copied_variables' should be a dict mapping pertinent copied variable
    names to the copied instances.
    The new instances are automatically inserted into the given 'namespace'.
    If namespace='', it is inserted into the graph's global namespace.
    However, to avoid naming conflicts, its better to provide a namespace.
    If the instance(s) happens to be a part of collection(s), they are
    are added to the appropriate collections in to_graph as well.
    For example, for collection 'C' which the instance happens to be a
    part of, given a namespace 'N', the new instance will be a part of
    'N/C' in to_graph.

    Returns the corresponding instance with respect to to_graph.

    TODO: Order of insertion into collections is not preserved

    #The name of the new instance
    if namespace != '':
        new_name = namespace + '/' +
        new_name =

    #If a variable by the new name already exists, return the
    #correspondng tensor that will act as an input
    if new_name in copied_variables:
        return to_graph.get_tensor_by_name(

    #If an instance of the same name exists, return appropriately
        already_present = to_graph.as_graph_element(new_name,
        return already_present

    #Get the collections that the new instance needs to be added to.
    #The new collections will also be a part of the given namespace.
    collections = []
    for name, collection in org_instance.graph._collections.items():
        if org_instance in collection:
            if namespace == '':
                collections.append(namespace + '/' + name)
    #Take action based on the class of the instance

    if isinstance(org_instance, tf.python.framework.ops.Tensor):

        #If its a Tensor, it is one of the outputs of the underlying
        #op. Therefore, copy the op itself and return the appropriate
        op = org_instance.op
        new_op = copy_to_graph(op, to_graph, copied_variables, namespace)
        output_index = op.outputs.index(org_instance)
        new_tensor = new_op.outputs[output_index]
        #Add to collections if any
        for collection in collections:
            to_graph.add_to_collection(collection, new_tensor)

        return new_tensor

    elif isinstance(org_instance, tf.python.framework.ops.Operation):

        op = org_instance

        #If it has an original_op parameter, copy it
        if op._original_op is not None:
            new_original_op = copy_to_graph(op._original_op, to_graph,
                                            copied_variables, namespace)
            new_original_op = None

        #If it has control inputs, call this function recursively on each.
        new_control_inputs = [copy_to_graph(x, to_graph, copied_variables,
                              for x in op.control_inputs]

        #If it has inputs, call this function recursively on each.
        new_inputs = [copy_to_graph(x, to_graph, copied_variables,
                      for x in op.inputs]

        #Make a new node_def based on that of the original.
        #An instance of tensorflow.core.framework.graph_pb2.NodeDef, it
        #stores String-based info such as name, device and type of the op.
        #Unique to every Operation instance.
        new_node_def = deepcopy(op._node_def)
        #Change the name = new_name

        #Copy the other inputs needed for initialization
        output_types = op._output_types[:]
        input_types = op._input_types[:]

        #Make a copy of the op_def too.
        #Its unique to every _type_ of Operation.
        op_def = deepcopy(op._op_def)

        #Initialize a new Operation instance
        new_op = tf.python.framework.ops.Operation(new_node_def,
        #Use Graph's hidden methods to add the op
        for device_function in reversed(to_graph._device_function_stack):

        return new_op

        raise TypeError("Could not copy instance: " + str(org_instance))

def get_copied(original, graph, copied_variables={}, namespace=""):
    Get a copy of the instance 'original', present in 'graph', under
    the given 'namespace'.
    'copied_variables' is a dict mapping pertinent variable names to the
    copy instances.

    #The name of the copied instance
    if namespace != '':
        new_name = namespace + '/' +
        new_name =

    #If a variable by the name already exists, return it
    if new_name in copied_variables:
        return copied_variables[new_name]

    return graph.as_graph_element(new_name, allow_tensor=True,

Working with feeding is pretty simple too:

>>> x = tf.placeholder("float")
>>> a = tf.constant(3, "float")
>>> y = tf.add(x, a)
>>> namespace = "CopiedOps"
>>> to_graph = tf.Graph()
>>> copied_variables = {}
>>> y1 = copy_to_graph(y, to_graph, namespace)
>>> x1 = get_copied(x, to_graph, namespace)
>>> with to_graph.as_default():
    sess = tf.Session()
    print, feed_dict={x1: 5})


I guess thats all for my hacking around with TensorFlow for the week. If you intend to use this code, please note that it may not be perfect at doing what it says. I haven’t tried it out with all sorts of TensorFlow data structures as yet, so be open to getting an Exception or two that you may have to fix. Infact, do drop me a comment or mail so I can make this code as fool-proof as I can. Cheers!

K-Means Clustering with TensorFlow

Google recently open-sourced its Artificial Intelligence/Numerical Computing library called TensorFlow. TensorFlow was developed by members of the Google Brain team, and has the flexibility to run on a variety of platforms – including GPUs and mobile devices.


TensorFlow’s methodology uses what they called data-flow graphs. Consider the following diagram from the Wikipedia page on Genetic Programming (which could have some interesting applications with TensorFlow, I think):

Genetic_Program_TreeAs you probably understood, the graphical structure is a way of representing a computational expression in the form of a Tree. Every node is an operation (TensorFlow calls them ops, short for operations). The non-leaf nodes are pretty easy to understand. Some leaf nodes are a special case of an operation, always ‘returning’ a constant value (like 7 or 2.2 in the Tree). Others (like X or Y) act as placeholders that will be fed in at the time of execution. If you look at the arrows, you will realize that their directions denote the dependencies between outputs of different nodes. Hence, data (TensorFlow calls them Tensors) will flow in the opposite direction along each node – Hence the name TensorFlow. TensorFlow provides other components over this graphical abstraction, like persistent memory elements that retain data (called Variables), and optimization techniques to fine-tune the parameters in these Variables in applications like Neural Networks.

TensorFlow has a powerful Python API. The TensorFlow team has done an awesome job of writing the documentation (which is a little tricky to navigate). If you are completely new to this, heres a few links to get you started (in the order you should visit them):

1. Basic setup

2. Read this example to get a vague idea of what a TensorFlow code looks like.

3. Now read this explanation of the basic components of TensorFlow. It helps if you read the above example again, or simultaneously.

4. Read this detailed example of using TensorFlow for a common ML problem.

5. Once you have a decent understanding of the basic components and how they work, you can look at the Python docs for reference.

Now here is the code I wrote for K-Means clustering using TensorFlow. As a disclaimer, I will mention that this code is based on my (at the time of writing this) 2-day old understanding of how the library works. If you find any errors or know any optimizations possible, do drop a comment! The code is heavily documented, so do go through in-line docs.

import tensorflow as tf
from random import choice, shuffle
from numpy import array

def TFKMeansCluster(vectors, noofclusters):
    K-Means Clustering using TensorFlow.
    'vectors' should be a n*k 2-D NumPy array, where n is the number
    of vectors of dimensionality k.
    'noofclusters' should be an integer.

    noofclusters = int(noofclusters)
    assert noofclusters < len(vectors)

    #Find out the dimensionality
    dim = len(vectors[0])

    #Will help select random centroids from among the available vectors
    vector_indices = list(range(len(vectors)))

    #We initialize a new graph and set it as the default during each run
    #of this algorithm. This ensures that as this function is called
    #multiple times, the default graph doesn't keep getting crowded with
    #unused ops and Variables from previous function calls.

    graph = tf.Graph()

    with graph.as_default():


        sess = tf.Session()


        ##First lets ensure we have a Variable vector for each centroid,
        ##initialized to one of the vectors from the available data points
        centroids = [tf.Variable((vectors[vector_indices[i]]))
                     for i in range(noofclusters)]
        ##These nodes will assign the centroid Variables the appropriate
        centroid_value = tf.placeholder("float64", [dim])
        cent_assigns = []
        for centroid in centroids:
            cent_assigns.append(tf.assign(centroid, centroid_value))

        ##Variables for cluster assignments of individual vectors(initialized
        ##to 0 at first)
        assignments = [tf.Variable(0) for i in range(len(vectors))]
        ##These nodes will assign an assignment Variable the appropriate
        assignment_value = tf.placeholder("int32")
        cluster_assigns = []
        for assignment in assignments:

        ##Now lets construct the node that will compute the mean
        #The placeholder for the input
        mean_input = tf.placeholder("float", [None, dim])
        #The Node/op takes the input and computes a mean along the 0th
        #dimension, i.e. the list of input vectors
        mean_op = tf.reduce_mean(mean_input, 0)

        ##Node for computing Euclidean distances
        #Placeholders for input
        v1 = tf.placeholder("float", [dim])
        v2 = tf.placeholder("float", [dim])
        euclid_dist = tf.sqrt(tf.reduce_sum(tf.pow(tf.sub(
            v1, v2), 2)))

        ##This node will figure out which cluster to assign a vector to,
        ##based on Euclidean distances of the vector from the centroids.
        #Placeholder for input
        centroid_distances = tf.placeholder("float", [noofclusters])
        cluster_assignment = tf.argmin(centroid_distances, 0)


        ##This will help initialization of all Variables defined with respect
        ##to the graph. The Variable-initializer should be defined after
        ##all the Variables have been constructed, so that each of them
        ##will be included in the initialization.
        init_op = tf.initialize_all_variables()

        #Initialize all variables


        #Now perform the Expectation-Maximization steps of K-Means clustering
        #iterations. To keep things simple, we will only do a set number of
        #iterations, instead of using a Stopping Criterion.
        noofiterations = 100
        for iteration_n in range(noofiterations):

            ##EXPECTATION STEP
            ##Based on the centroid locations till last iteration, compute
            ##the _expected_ centroid assignments.
            #Iterate over each vector
            for vector_n in range(len(vectors)):
                vect = vectors[vector_n]
                #Compute Euclidean distance between this vector and each
                #centroid. Remember that this list cannot be named
                #'centroid_distances', since that is the input to the
                #cluster assignment node.
                distances = [, feed_dict={
                    v1: vect, v2:})
                             for centroid in centroids]
                #Now use the cluster assignment node, with the distances
                #as the input
                assignment =, feed_dict = {
                    centroid_distances: distances})
                #Now assign the value to the appropriate state variable
      [vector_n], feed_dict={
                    assignment_value: assignment})

            #Based on the expected state computed from the Expectation Step,
            #compute the locations of the centroids so as to maximize the
            #overall objective of minimizing within-cluster Sum-of-Squares
            for cluster_n in range(noofclusters):
                #Collect all the vectors assigned to this cluster
                assigned_vects = [vectors[i] for i in range(len(vectors))
                                  if[i]) == cluster_n]
                #Compute new centroid location
                new_location =, feed_dict={
                    mean_input: array(assigned_vects)})
                #Assign value to appropriate variable
      [cluster_n], feed_dict={
                    centroid_value: new_location})

        #Return centroids and assignments
        centroids =
        assignments =
        return centroids, assignments

Never, ever, EVER, do something like this:

for i in range(100):
    x =, placeholder))

This may seem pretty harmless at first glance, but every time you initialize an op, (like tf.assign or even tf.zeros, you are adding new ops instances to the default graph. Instead, as shown in the code, define a particular op for each task (however specialized) just once in the code. Then, during every of your iterations, call over the required nodes. To check if you are crowding your graph with unnecessary ops, just print out the value of len(graph.get_operations()) during every iteration and see if its is increasing. In fact, should be the only way you interact with the graph during every iteration.

As visible on lines 138 and 139, you can call over a list of ops/Variables to return a list of the outputs in the same order.

There are a lot of intricacies of TensorFlow that this code does not go into, such as assigning devices to nodes, Graph collections, dependencies, etc. Thats partially because I am still understanding these aspects one by one. But at first glance, TensorFlow seems to be a pretty powerful and flexible way of doing AI/ML-based computations. I would personally like to explore its applications in developing dependency-based statistical metrics for data – for which I am currently using custom tree-like data structures. Lets hope this gesture by Google does lead to an increase in the applications and research in AI. Cheers!

Traffic Lights and Nash Equilibrium

A post after a long while! I have been wanting to write a short blog post on how traffic lights relate to Nash equilibrium for a long time, so decided to finally do it today. This is an short but smart example on how traffic lights can be thought of as a medium to enforce certain special conditions among agents (people driving on the roads), and avoid conflicts in the process.

Imagine the scenario of two cars coming along different roads towards an intersection. As they approach the point where the two roads meet, they notice each other. Each one now has two choices: to either ‘Go‘ (i.e. continue driving) or ‘Stop‘ (i.e. hit the brakes). Lets draw the corresponding pay-off matrix (Calling the cars Player1 and Player2)

            Go              Stop

Go       -10, -10      4, -1

Stop     -1, 4           -3, -3

Assume Player1 corresponds to the rows, and Player2 corresponds to the columns. Of course it doesn’t really matter here, since the matrix is symmetrical. For each of the four tuples, the first number corresponds to the payoff to Player1, and the second one to Player2. Lets look at the payoffs in each of the four scenarios. If both players Go, it will cause an accident- hence the high negative payoffs for both. If one Goes and the other Stops, the stopping one gets a small negative payoff (-1), while the one who continues driving gets the best possible payoff (4). The weirdest case would be if both Stop, causing something of an infinite wait (technically) – yielding negative payoffs for both (Though not as negative as an accident, since being bored waiting is better than being dead any day). So how do we determine who does what? For that, we will analyze the best responses.

Consider Player1. If Player2 decides to Go, Player1’s better payoff comes from doing Stop (-1 > -10). On the other hand, if Player2 decides to Stop, Player1 should Go (4 > -3). Now since this is a symmetrical scenario, the same applies the other way round too. So lets mark these ‘best responses’ on the payoff matrix.

             Go              Stop

    Go       -10, -10      4*, -1*

    Stop     -1*, 4*           -3, -3

Remember: In a pure-strategy payoff matrix, whenever a cell corresponds to the best response for all players involved, it indicates a Nash equilibrium. In simple terms, a Nash equilibrium is a scenario where NO player benefits from changing his/her strategy, given the strategies of all the others. Look at the cell (Stop, Go) for example. Player1 has decided to halt and Player2 is continuing to drive. Would either be better off doing the opposite, given the other continues with what he is doing? Not really, cause it will either lead to an accident (Go, Stop) or an infinite wait (Stop, Stop).

So we have two different Nash Equilibria, essentially at the scenarios where one player goes and other stops. But we must realize that each player will prefer one of these two scenarios. Player1 prefers (Go, Stop) – where he gets to continue on and the other stops, while Player2 prefers (Stop, Go). If both decide to do what they prefer, its obviously an accident.

Now this is where traffic lights come in. Based on the time instant when the two cars arrive at the intersection, the traffic lights will tell one Player to stop, while the other to Go – thus forcing the whole ‘Game’ into one of its Nash equilibria. The main thing to remember is that it does so randomly, or atleast fairly (Sometimes, a light may go green for a little longer for roads that are known to have high traffic at certain times of the day). Because of this, no agent (i.e. driver) has any incentive to gain from deviating from what the traffic lights tell him/her to do. Obviously (like all Game theory simulations really), this is a simplified version of what really happens, but you get the basic idea: traffic lights randomly push the game towards one of its Nash equilibria.

Generating rudimentary Mind-Maps from Word2Vec models

Mind Maps are notorious for being a very powerful organizational tool for a variety of tasks, such as brainstorming, planning and problem solving. Visual (or rather, graphical) arrangement of ideas aids the thought process, and in fact mimics the way we explore our mental knowledge base while ‘thinking’. There are a lot of online tools available for drawing out mind maps, but none that can generate one. By generate, I mean coming up with the verbal content.

For the longest time (almost 8 months now), I have been tinkering with ways to combine text mining and graph theory into a framework to generate a Mind-Map (given a text document). Ofcourse, the first thing argument would be that there cannot be a single possible Mind-Map for any block of text. And its true! However, having such an automated Map as a reference while building your own, might give you more insights (especially while brainstorming), or help you remember links that you might miss out (for studying). Lets see what a Mind-Map looks like:


Two points:

i. A Mind-Map is NOT a tree that divides the overall topic into its subtopics recursively. Its infact more like a graph, that links terms that are semantically related.

ii. Like ‘Night’ might make the word ‘Day’ pop up in your mind, a mind-map may have links between opposite meaning concepts, like Thicker-Thinner in the above example.

There are of course other points like using images to enhance concepts, and so on. But thats not the point of this post (And I suck at designer-style creativity anyways). Heres an article to help you get acquainted with the process of building and using your own Mind-Maps, just in case.

In my last post, I described a method to generate a Word2Vec model from a text document (where I used Wikipedia articles as an example). I will now describe the methodology I followed to generate a rudimentary mind-map from a Wikipedia article’s Word2Vec model model.

Step 1: Figuring out the top n terms from the document

(As I mentioned in my previous post, I only use stemmed unigrams. You can of course use higher-order ngrams, but that makes things a little tricky (but more accurate, if your algorithms for n-gram generation are solid).)

Here, n denotes the number of ‘nodes’ in my Mind-Map. In my trial-and-errors, 50 is usually a good enough number. Too less means too little information, and too much would mean noisy mind-maps. You can obviously play around with different choices of n. I use the co-occurrence based technique described in this paper to list out the top n words in a document. Heres the Python code for it:

def _get_param_matrices(vocabulary, sentence_terms):
    1. Top 300(or lesser, if vocab is short) most frequent terms(list)
    2. co-occurence matrix wrt the most frequent terms(dict)
    3. Dict containing Pg of most-frequent terms(dict)
    4. nw(no of terms affected) of each term(dict)

    #Figure out top n terms with respect to mere occurences
    n = min(300, len(vocabulary))
    topterms = list(vocabulary.keys())
    topterms.sort(key = lambda x: vocabulary[x], reverse = True)
    topterms = topterms[:n]

    #nw maps term to the number of terms it 'affects'
    #(sum of number of terms in all sentences it
    #appears in)
    nw = {}
    #Co-occurence values are wrt top terms only
    co_occur = {}
    #Initially, co-occurence matrix is empty
    for x in vocabulary:
        co_occur[x] = [0 for i in range(len(topterms))]

    #Iterate over list of all sentences' vocabulary dictionaries
    #Build the co-occurence matrix
    for sentence in sentence_terms:
        total_terms = sum(list(sentence.values()))
        #This list contains the indices of all terms from topterms,
        #that are present in this sentence
        top_indices = []
        #Populate top_indices
        top_indices = [topterms.index(x) for x in sentence
                       if x in topterms]
        #Update nw dict, and co-occurence matrix
        for term in sentence:
            nw[term] = nw.get(term, 0) + total_terms
            for index in top_indices:
                co_occur[term][index] += (sentence[term] *

    #Pg is just nw[term]/total vocabulary of text
    Pg = {}
    N = sum(list(vocabulary.values()))
    for x in topterms:
        Pg[x] = float(nw[x])/N

    return topterms, co_occur, Pg, nw

def get_top_n_terms(vocabulary, sentence_terms, n=50):
    Returns the top 'n' terms from a block of text, in the form of a list,
    from most important to least.

    'vocabulary' should be a dict mapping each term to the number
    of its occurences in the entire text.
    'sentence_terms' should be an iterable of dicts, each denoting the
    vocabulary of the corresponding sentence.

    #First compute the matrices
    topterms, co_occur, Pg, nw = _get_param_matrices(vocabulary,

    #This dict will map each term to its weightage with respect to the
    result = {}

    N = sum(list(vocabulary.values()))
    #Iterates over all terms in vocabulary
    for term in co_occur:
        term = str(term)
        org_term = str(term)
        for x in Pg:
            #expected_cooccur is the expected cooccurence of term with this
            #term, based on nw value of this and Pg value of the other
            expected_cooccur = nw[term] * Pg[x]
            #Result measures the difference(in no of terms) of expected
            #cooccurence and  actual cooccurence
            result[org_term] = ((co_occur[term][topterms.index(x)] -
                                 expected_cooccur)**2/ float(expected_cooccur))

    terms = list(result.keys())
    terms.sort(key=lambda x: result[x],

    return terms[:n]

The get_top_n_terms function does the job, and I guess the docstrings and in-line comments explain how (combined with the paper, of course). If you have the patience and time, you can infact just see the entire vocabulary of your Word2Vec model and pick out those terms that you want to see in your Mind-Map. This is likely to give you the best results (but with a lot of efforts).

Step 2: Deciding the Root

The Root would be that term out of your nodes, which denotes the central idea behind your entire Mind-Map. Since the number of nodes is pretty small compared to the vocabulary, its best to pick this one out manually. OR, you could use that term which has the highest occurrence in the vocabulary, among the selected nodes. This step may require some trial and error (But then what part of data science doesn’t?).

Step 3: Generating the graph (Mind-Map)

This is of course the most crucial step, and the one I spent the most time on. First off, let me define what I call the contextual vector of a term.

Say the root of the Mind Map is ‘computer’. It is linked to the term ‘hardware’. ‘hardware’ is in turn linked to ‘keyboard’. The Word2Vec vector of ‘keyboard’ would be obtained as model[keyboard] in the Python/Gensim environment. Lets denote it with v_{keyboard}.

Now consider yourself in the position of someone building a Mind Map. When you think of ‘keyboard’, given the structure of what you have come up with so far, you will be thinking of it in the context of ‘computer’ and ‘hardware’. Thats why you probably won’t link ‘keyboard’ to ‘music’ (atleast not directly). This basically shows that the contextual vector for ‘keyboard’ (lets call it v'_{keyboard}) must be biased in its direction (since we use cosine similarity with Word2Vec models, only directions matter) towards v_{computer} and v_{hardware}. Moreover, intuitively speaking, the influence of v_{hardware} on v'_{keyboard} should be greater than that of v_{computer} – in essence, the influence of the context of a parent reduces as you go further and further away from it. To take this into account, I use what I call the contextual decay factor \alpha. Expressing it mathematically,

v'_{computer} = v_{computer}

v'_{hardware} = (1 - \alpha) v_{hardware} + \alpha v'_{computer}

v'_{keyboard} = (1 - \alpha) v_{keyboard} + \alpha v'_{hardware}

And so on…

Finally, to generate the actual Mind-Map, heres the algorithm I use (I hope the inline comments are enough to let you know what I have done):

from scipy.spatial.distance import cosine
from networkx import Graph

def build_mind_map(model, stemmer, root, nodes, alpha=0.2):
    Returns the Mind-Map in the form of a NetworkX Graph instance.

    'model' should be an instance of gensim.models.Word2Vec
    'nodes' should be a list of terms, included in the vocabulary of
    'root' should be the node that is to be used as the root of the Mind
    Map graph.
    'stemmer' should be an instance of StemmingHelper.

    #This will be the Mind-Map
    g = Graph()

    #Ensure that the every node is in the vocabulary of the Word2Vec
    #model, and that the root itself is included in the given nodes
    for node in nodes:
        if node not in model.vocab:
            raise ValueError(node + " not in model's vocabulary")
    if root not in nodes:
        raise ValueError("root not in nodes")

    ##Containers for algorithm run
    #Initially, all nodes are unvisited
    unvisited_nodes = set(nodes)
    #Initially, no nodes are visited
    visited_nodes = set([])
    #The following will map visited node to its contextual vector
    visited_node_vectors = {}
    #Thw following will map unvisited nodes to (closest_distance, parent)
    #parent will obviously be a visited node
    node_distances = {}

    #Initialization with respect to root
    current_node = root
    visited_node_vectors[root] = model[root]

    #Build the Mind-Map in n-1 iterations
    for i in range(1, len(nodes)):
        #For every unvisited node 'x'
        for x in unvisited_nodes:
            #Compute contextual distance between current node and x
            dist_from_current = cosine(visited_node_vectors[current_node],
            #Get the least contextual distance to x found until now
            distance = node_distances.get(x, (100, ''))
            #If current node provides a shorter path to x, update x's
            #distance and parent information
            if distance[0] > dist_from_current:
                node_distances[x] = (dist_from_current, current_node)

        #Choose next 'current' as that unvisited node, which has the
        #lowest contextual distance from any of the visited nodes
        next_node = min(unvisited_nodes,
                        key=lambda x: node_distances[x][0])

        ##Update all containers
        parent = node_distances[next_node][1]
        del node_distances[next_node]
        next_node_vect = ((1 - alpha)*model[next_node] +
        visited_node_vectors[next_node] = next_node_vect

        #Add the link between newly selected node and its parent(from the
        #visited nodes) to the NetworkX Graph instance

        #The new node becomes the current node for the next iteration
        current_node = next_node

    return g

Notes: I use NetworkX’s simple Graph-building infrastructure to do the core job of maintaining the Mind-Map (makes it easier later for visualization too). To compute cosine distance, I use SciPy. Moreover, on lines 74 and 75, I use the StemmingHelper class from my last post to include the stemmed words in their original form in the actual mind-map (instead of their stemmed versions). You can pass the StemmingHelper class directly as the parameter stemmer. On the other hand, if you aren’t using stemming at all, just remove those parts of the code on lines 4, 74, and 75.

If you look at the algorithm, you will realize that its somewhat like Dijkstra’s algorithm for single-source shortest paths, but in a different context.

Example Outputs

Now for the results. (I used PyGraphViz for simple, quick-and-dirty visualization)

Heres the Mind-Map that was generated for the Wikipedia article on Machine Learning:


One on Artificial Intellgence:

aiAnd finally, one on Psychology:


The results are similar on all the other topics I tried out. A few things I noticed:

i. I should try and involve bi-grams and trigrams too. I am pretty sure it will make the Word2Vec model itself way stronger, and thus improve the interpretation of terms with respect to the document.

ii. There are some unnecessary terms in the Mind Maps, but given the short length of the texts (compared to most text mining tasks), the Keyword extraction algorithm in the paper I mentioned before, seems really good.

iii. Maybe one could use this for brainstorming. Like you start out with a term(node) of your choice. Then, the framework suggests you terms to connect it to. Once you select one of them, you get further recommendations for it based on the context, etc. – Something like a Mind-Map helper.

Anyways, this was a long post, and thanks a lot if you stuck to the end :-). Cheers!

Generating a Word2Vec model from a block of Text using Gensim (Python)


Word2Vec is a semantic learning framework that uses a shallow neural network to learn the representations of words/phrases in a particular text. Simply put, its an algorithm that takes in all the terms (with repetitions) in a particular document, divided into sentences, and outputs a vectorial form of each. The ‘advantage’ word2vec offers is in its utilization of a neural model in understanding the semantic meaning behind those terms. For example, a document may employ the words ‘dog’ and ‘canine’ to mean the same thing, but never use them together in a sentence. Ideally, Word2Vec would be able to learn the context and place them together in its semantic space. Most applications of Word2Vec using cosine similarity to quantify closeness. This Quora question (or rather its answers) does a good job of explaining the intuition behind it.

You would need to take the following steps to develop a Word2Vec model from a block of text (Usually, documents that are extensive and yet stick to the topic of interest with minimum ambiguity do well):

[I use Gensim’s Word2Vec API in Python to form Word2Vec models of Wikipedia articles.]

1. Obtain the text (obviously)

To obtain the Wikipedia articles, I use the Python wikipedia library. Once installed from the link, here’s how you could use it obtain all the text from an aritcle-

#'title' denotes the exact title of the article to be fetched
title = "Machine learning"
from wikipedia import page
wikipage = page(title)

You could then use wikipage.context to access the entire textual context in the form of a String. Now, incase you don’t have the exact title and want to do a search, you would do:

from wikipedia import search, page
titles = search('machine learning')
wikipage = page(titles[0])

[Tip: Store the content into a file and access it from there. This would provide you a reference later, if needed.]

2. Preprocess the text

In the context of Python, you would require an iterable that yields one iterable for each sentence in the text. The inner iterable would contain the terms in the particular sentence. A ‘term’ could be individual words like ‘machine’, or phrases(n-grams) like ‘machine learning’, or a combination of both. Coming up with appropriate bigrams/trigrams is a tricky task on its own, so I just stick to unigrams.

First of all, I remove all special characters and short lines from the article, to eliminate noise. Then, I use Porter Stemming on my unigrams, using a ‘wrapper’ around Gensim’s stemming API.

from gensim.parsing import PorterStemmer
global_stemmer = PorterStemmer()

class StemmingHelper(object):
    Class to aid the stemming process - from word to stemmed form,
    and vice versa.
    The 'original' form of a stemmed word will be returned as the
    form in which its been used the most number of times in the text.

    #This reverse lookup will remember the original forms of the stemmed
    word_lookup = {}

    def stem(cls, word):
        Stems a word and updates the reverse lookup.

        #Stem the word
        stemmed = global_stemmer.stem(word)

        #Update the word lookup
        if stemmed not in cls.word_lookup:
            cls.word_lookup[stemmed] = {}
        cls.word_lookup[stemmed][word] = (
            cls.word_lookup[stemmed].get(word, 0) + 1)

        return stemmed

    def original_form(cls, word):
        Returns original form of a word given the stemmed version,
        as stored in the word lookup.

        if word in cls.word_lookup:
            return max(cls.word_lookup[word].keys(),
                       key=lambda x: cls.word_lookup[word][x])
            return word

Refer to the code and docstrings to understand how it works. (Its pretty simple anyways). It can be used as follows-

>>> StemmingHelper.stem('learning')
>>> StemmingHelper.original_form('learn')

Pre-stemming, you could also use a list of stopwords to eliminate terms that occur frequently in the English language, but don’t carry much semantic meaning.

After your pre-processing, lets assume you come up with an iterable called sentences from your source of text.

3. Figure out the values for your numerical parameters

Gensim’s Word2Vec API requires some parameters for initialization. Ofcourse they do have default values, but you want to define some on your own:

i. size – Denotes the number of dimensions present in the vectorial forms. If you have read the document and have an idea of how many ‘topics’ it has, you can use that number. For sizeable blocks, people use 100-200. I use around 50 for the Wikipedia articles. Usually, you would want to repeat the initialization for different numbers of topics in a certain range, and pick the one that yields the best results (depending on your application – I will be using them to build Mind-Maps, and I usually have to try values from 20-100.). A good heuristic thats frequently used is the square-root of the length of the vocabulary, after pre-processing.

ii. min_count – Terms that occur less than min_count number of times are ignored in the calculations. This reduces noise in the semantic space. I use 2 for Wikipedia. Usually, the bigger and more extensive your text, the higher this number can be.

iii. window – Only terms hat occur within a window-neighbourhood of a term, in a sentence, are associated with it during training. The usual value is 4. Unless your text contains big sentences, leave it at that.

iv. sg – This defines the algorithm. If equal to 1, the skip-gram technique is used. Else, the CBoW method is employed. (Look at the aforementioned Quora answers). I usually use the default(1).

4. Initialize the model and use it

The model can be generated using Gensim’s API, as follows:

from gensim.models import Word2Vec
min_count = 2
size = 50
window = 4

model = Word2Vec(sentences, min_count=min_count, size=size, window=window)

Now that you have the model initialized, you can access all the terms in its vocabulary, using something like list(model.vocab.keys()). To get the vectorial representation of a particular term, use model[term]. If you have used my stemming wrapper, you could find the appropriate original form of the stemmed terms using StemmingHelper.original_form(term). Heres an example, from the Wiki article on Machine learning:

>>> vocab = list(model.vocab.keys())
>>> vocab[:10]
[u'represent', u'concept', u'founder', u'focus', u'invent', u'signific', u'abil', u'implement', u'benevol', u'hierarch']
>>> 'learn' in model.vocab
>>> model['learn']
array([  1.23792759e-03,   5.49776992e-03,   2.18261080e-03,
         8.37465748e-03,  -6.10323064e-03,  -6.94877980e-03,
         6.29429379e-03,  -7.06598908e-03,  -7.16267806e-03,
        -2.78065586e-03,   7.40372669e-03,   9.68673080e-03,
        -4.75220988e-03,  -8.34807567e-03,   5.25208283e-03,
         8.43616109e-03,  -1.07231298e-02,  -3.88528360e-03,
        -9.20894090e-03,   4.17305576e-03,   1.90116244e-03,
        -1.92442467e-03,   2.74807960e-03,  -1.01113841e-02,
        -3.71694425e-03,  -6.60350174e-03,  -5.90716442e-03,
         3.90679482e-03,  -5.32188127e-03,   5.63300075e-03,
        -5.52612450e-03,  -5.57334488e-03,  -8.51202477e-03,
        -8.78736563e-03,   6.41061319e-03,   6.64879987e-03,
        -3.55080629e-05,   4.81080823e-03,  -7.11903954e-03,
         9.83678619e-04,   1.60697231e-03,   7.42980337e-04,
        -2.12235347e-04,  -8.05167668e-03,   4.08948492e-03,
        -5.48054813e-04,   8.55423324e-03,  -7.08682090e-03,
         1.57684216e-03,   6.79725129e-03], dtype=float32)
>>> StemmingHelper.original_form('learn')
>>> StemmingHelper.original_form('hierarch')

As you might have guessed, the vectors are NumPy arrays, and support all their functionality. Now, to compute the cosine similarity between two terms, use the similarity method. Cosine similarity is generally bounded by [-1, 1]. The corresponding ‘distance’ can be measured as 1-similarity. To figure out the terms most similar to a particular one, you can use the most_similar method.

>>> model.most_similar(StemmingHelper.stem('classification'))
[(u'spam', 0.25190210342407227), (u'metric', 0.22569453716278076), (u'supervis', 0.19861873984336853), (u'decis', 0.18607790768146515), (u'inform', 0.17607420682907104), (u'artifici', 0.16593246161937714), (u'previous', 0.16366994380950928), (u'train', 0.15940310060977936), (u'network', 0.14765430986881256), (u'term', 0.14321796596050262)]
>>> model.similarity(StemmingHelper.stem('classification'), 'supervis')
>>> model.similarity('unsupervis', 'supervis')

There’s a ton of other functionality that’s supported by the class, so you should have a look at the API I gave a link to. Happy topic modelling 🙂