A (small) introduction to Boosting

What is Boosting?

Boosting is a machine learning meta-algorithm that aims to iteratively build an ensemble of weak learners, in an attempt to generate a strong overall model.

Lets look at the highlighted parts one-by-one:

1. Weak Learners: A ‘weak learner’ is any ML algorithm (for regression/classification) that provides an accuracy slightly better than random guessing. For example, consider a problem of binary classification with approximately 50% of samples belonging to each class. Random guessing in this case would yield an accuracy of around 50%. So a weak learner would be any algorithm, however simple, that slightly improves this score – say 51-55% or more. Usually, weak learners are pretty basic in nature. Some examples are Decision Stumps or smaller Decision Trees. In most cases, either one of these two is used as a weak learner in Boosting frameworks.

2. Ensemble: The overall model built by Boosting is a weighted sum of all of the weak learners. The weights and training given to each ensures that the overall model yields a pretty high accuracy (sometimes state-of-the-art).

3. Iteratively build: Many ensemble methods such as bagging/random forests are capable of training their components in parallel. This is because the training of any of those weak learners does not depend on the training of the others. That is not the case with Boosting. At each step, Boosting tries to evaluate the shortcomings of the overall model built till the previous step, and then generates a weak learner to battle these shortcomings. This weak learner is then added to the total model. Therefore, the training must necessarily proceed in a serial/iterative manner.

4. Meta-algorithm: Since Boosting isn’t necessarily an ML algorithm by itself, but rather uses other (basic) algorithms to build a stronger one, it is said to be a ‘meta’ algorithm.

How does Boosting work?


Usually, a Boosting framework for regression works as follows:

(The overall model at step ‘i’ is denoted by F_i.)

1. Start with a simple model F_0, usually predicting a common (average) value for all instances.

2. For i from 1 to M do:

2.a) Evaluate the shortcomings of F_{i - 1}, in terms of a value r_j for each element x_j in the training set.
2.b) Generate a new weak learner h_i(x) based on the r_js.
2.c) Compute the corresponding weight \gamma_i.
2.d) Update the current model as follows:

F_i = F_{i - 1} + \nu \gamma_i h_i(x)

where \nu denotes the learning rate of the framework.

3. Return F_M as the overall output.

Each of the iterations in Boosting essentially tries to ‘improve’ the current model by introducing another learner into the ensemble. Having such an ensemble not only reduces the bias (which is generally pretty high for weak learners), but also the variance (since multiple learners contribute to the overall output, each with their own unique training). There are various versions of Boosting, and all of them primarily differ in the way they perform steps 2.a, 2.b. and 2.c.

For example, Gradient Boosting computes r_j as the gradient of a Loss function (whose expression involves target output y and model output till previous step, F_{i - 1}) with respect to F_{i - 1} at data point x_j. The ith weak learner h_i is then trained to predict r_j from x_j. The weight \gamma_i is then optimized so as to minimize the total Loss value with respect to F_i.

AdaBoost, short for Adaptive Boosting, works by assigning r_j as a ‘weight’ to x_j based on how miscalculated the model output is according to F_{i - 1}. h_i and \gamma_i are then trained to optimize the weighted sum of errors with respect to the output from h_i. AdaBoost is considered to be one of the best out-of-the-box classifiers today. Mathematically speaking, AdaBoost is a special case of Gradient Boosting.

The weak-learners used in Boosting are indeed ‘weak’. The most commonly used ones are decision trees of a fixed number of leaf nodes. Using complex methods such as SVMs in Boosting usually leads to overfitting. However, the total number of weak learners M should also be controlled, mostly based on experimentation. Having a low M will cause underfitting, but a very high value could lead to overfitting.

To improve the bias/variance characteristics of Boosting, bagging is generally employed. What this essentially does is train each weak learner on a subset of the overall dataset (rather than the whole training set). This causes a good decrease in variance and tends to reduce overfitting.

Classification applications would work slightly different from regression, of course. The most commonly used method is to build a boosted model for each class. This model would predict the probability of a data point belonging to that class. During training, the inputs belonging to the corresponding class will be given a target of 1, with the others getting a target of 0. The outputs from all the models could then be put into a softmax function for normalization.

What are the drawbacks of Boosting?

The most obvious drawback of boosting is the computational complexity. Performing so many iterations, and generating a new model at each, requires a lot of computations and time (and also space). The fact that the ensemble has to be built iteratively doesn’t help matters. However, using simple weak learners (instead of big decision trees) helps to mitigate this problem.

The biggest criticism for Boosting comes from its sensitivity to noisy data. Think of it this way. At each iteration, Boosting tries to improve the output for data points that haven’t been predicted well till now. If your dataset happens to have some (or a lot) of misclassified/outlier points, the meta-algorithm will try very hard to fit subsequent weak learners to these noisy samples. As a result, overfitting is likely to occur. The exponential loss function used in AdaBoost is particularly vulnerable to this issue (since the error from an outlier will get an exponentially-weighed importance for future learners).

Implementation in Scikit-Learn

  1. AdaBoost implementation
  2. Gradient Tree Boosting implementation

Simple Turing Machine using TensorFlow

A post after many days! This one describes a simple code I wrote to simulate a Deterministic Turing Machine using TensorFlow. Its pretty basic with respect to the design, using TensorFlow only for the state change computations. The movement on the tape, and handling of halts is taken care of by old-school Python. This is not intended to be a practical usage of TensorFlow, its just something I did and am writing about here :-).

To demonstrate the code, I will use the 3-state Busy Beaver example from the Wiki page.

Encoding the Problem

Each of the states are encoded as real numbers (integers in this case, but for whatever reasons you could use floats too). So ‘A’, ‘B’, ‘C’ and ‘HALT’ become 0, 1, 2 and 3 respectively. Accordingly, the initial and final states are defined as follows:

initial_state = 0
final_states = [3]

Since there can be multiple final states generally, final_states is defined as a list. As with the states, the potential symbols are numbers too. The blank symbol, as per the Busy Beaver problem definition, is 0.

blank_input = 0

Now we come to the transition function. It is defined using two 2-D NumPy matrices, one for the inputs and the other for the corresponding outputs.

import numpy as np

transition_input = np.array([[0, 0],
                             [0, 1],
                             [1, 0],
                             [1, 1],
                             [2, 0],
                             [2, 1]])

transition_output = np.array([[1, 1, 1],
                              [2, 1, -1],
                              [0, 1, -1],
                              [1, 1, 1],
                              [1, 1, -1],
                              [3, 1, 1]])

Consider the transition_input[2] and transition_output[2]. The input denotes [current state, current symbol]. The output denotes [next state, next symbol, movement]. The movement value can either be 1(move pointer right, or increment index value) or -1(move pointer left, or decrement index value). Accordingly, transition_input[2] requires the current state to be 1 and current symbol being read to be 0. If these conditions are met, then the next state and next symbol are 0 and 1 respectively, with the pointer on the tape moving one step to the left.

The Turing Machine Code

The actual Turing Machine code follows. Its well documented as always, so you shouldn’t have a problem following whats going on if you are well-acquainted with Turing Machines and TensorFlow.

import tensorflow as tf

class TuringMachine(object):
    Implementation of a basic Turing Machine using TensorFlow.

    def __init__(self, initial_state, final_states, blank_input,
                 transition_input, transition_output):
        Initializes the TensorFlow Graph required for running the Turing

        States and symbols should both be encoded as real numbers prior to
        initialization. Accordingly, 'initial_state' should be a float
        denoting the starting state. 'final_states' should be an iterable
        of numbers, denoting all the states the Machine can halt at.
        'blank_input' should be a float denoting the blank input for the
        'transition_input' and 'transition_output' will together denote
        the transition function followed by the Machine. Both should be
        2-D NumPy matrices. Each inner vector in 'transition_input' will
        contain [current state, current symbol], while each inner vector
        in 'transition_output' will have [next state, next symbol, move].
        "move" can either be 1: to increase pointer index by 1 (move
        right), or -1: decrease pointer index by 1 (move left).

        #Store the necessary information as attributes
        self._initial_state = float(initial_state)
        self._final_states = set(final_states)
        self._blank_input = blank_input

        #Initialize Graph
        self._graph = tf.Graph()

        n_transitions = len(transition_input)
        if len(transition_input) != len(transition_output):
            raise ValueError("Number of transition inputs and outputs " +
                             "not same")

        #Initialize all components in the Graph
        with self._graph.as_default():

            #Placeholder for current symbol thats being read from the
            self._current_symbol = tf.placeholder("float")

            #Variable to hold current state
            self._current_state = tf.Variable(self._initial_state)

            #Constant Ops for transitions
            transition_in = tf.constant(transition_input)
            transition_out = tf.constant(transition_output)

            #Generate Ops to compute transition
            concat_in = tf.pack([self._current_state,
            packed_in = tf.pack([concat_in for i in
            transition_diff = tf.reduce_sum(
                tf.pow(tf.sub(packed_in, transition_input), 2), 1)
            #This will return a 0 if theres an accurate match for the
            #transition condition (current state, current symbol). If not,
            #this will return a non-zero value.
            self._transition_match = tf.reduce_min(transition_diff, 0)
            #This will correspond to the index of the match
            match_index = tf.cast(tf.argmin(transition_diff, 0), "int32")
            #Pick out the transition output
            self._transition = tf.reshape(
                tf.slice(transition_out,  tf.pack([match_index, 0]),
                         tf.pack([1, 3])), [3])
            #This will change state Variable
            self._state_change = tf.assign(
                self._current_state, tf.reshape(
                    tf.cast(tf.slice(self._transition, tf.constant([0]),
                                     tf.constant([1])), "float32"), []))

            #This will reset the state Variable
            self._state_reset = tf.assign(self._current_state,

            #Initialize Session
            self._sess = tf.Session()
            #Initialize Variables
            init_op = tf.initialize_all_variables()

    def run(self, input_tape, start_index, max_iter=1000):
        Runs the Turing Machine with the given input.
        'input_tape' must be an index-able object, with 'start_index'
        denoting the index number to start computation at.

        Obviously, beware of inputs that may cause the machine to run
        indefinitely. To control this, set 'max_iter' to some number of
        iterations you would want to allow.

        Will return final state and final index. If Machine halts at a
        non-final state, or goes beyond max_iter, i.e. the Machine does
        not accept the input, will return None, None.
        The input tape will be modified in-place.

        #First reset the state

        #Run the Machine
        index = start_index
        for i in range(max_iter):
            #Figure out input to be given
            if (index > -1 and len(input_tape) > index):
                machine_input = input_tape[index]
                machine_input = self._blank_input

            #Run one step of the Machine
            _, match_value, transition = self._sess.run(
                [self._state_change, self._transition_match,
                feed_dict = {self._current_symbol: machine_input})

            #If match_value is not zero, return None, None
            if match_value != 0:
                return None, None

            #First modify tape contents at current index as per transition
            if (index > -1 and len(input_tape) > index):
                #Index is within input tape limits
                input_tape[index] = transition[1]
            elif index == len(input_tape):
                #Index has gone beyond the range of the input tape,
                #to the right
                #Index has gone beyond the range of the input tape,
                #to the left
                input_tape.insert(0, transition[1])
                index = 0

            #Change index as per transition
            index = index + transition[2]

            #If its a final state, return state, index
            if transition[0] in self._final_states:
                return transition[0], index

        return None, None

A few things:

1. If the tape ‘pointer’ goes beyond the limits of input_tape, the blank input is automatically given to the machine (lines 114-118). Moreover,  depending on which end of the tape the pointer has gone across, the output value from the machine is inserted into the list (lines 130-142).

2. Since there is no way to check if the Machine will stop, the run method lets the user provide a maximum number of iterations. If the machine keeps running beyond the limit, (None, None) is returned.

3. The input_tape list is modified in-place.

4. If you want to track the transitions the Turing Machine follows, add a ‘print’ statement (or some kind of logging) at line 125.

Running the Turing Machine

First we define the contents of the input tape, and the index to start at.

>>> input_tape = [0, 1, 1, 1, 1, 1]
>>> start_index = 0

Then, we initialize a TuringMachine instance with the algorithm as encoded above:

>>> machine = TuringMachine(initial_state, final_states, blank_input,
                        transition_input, transition_output)

Finally, we ‘run’ the machine, and check if the input tape has been modified correctly:

>>> machine.run(input_tape, start_index)
(3, 5)
>>> input_tape
[1, 1, 1, 1, 1, 1, 1]
>>> machine.run(input_tape, start_index)
(3, 9)
>>> input_tape
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Thats it! This is my first implementation of a Turing Machine, so do leave a comment if there are any errors/improvements possible. Cheers!

Predicting Trigonometric Waves few steps ahead with LSTMs in TensorFlow

I have recently been revisiting my study of Deep Learning, and I thought of doing some experiments with Wave prediction using LSTMs. This is nothing new, just more of a log of some tinkering done using TensorFlow.

The Problem

The basic input to the model is a 2-D vector – each number corresponding to the value attained by the corresponding wave. Each wave in turn is: (a constant + a sine wave + a cosine wave). The waves themselves have different magnitudes, initial phases and frequencies. The goal is to predict the values that will be attained a certain (I chose 23) steps ahead on the curve.

So first off, heres the wave-generation code:

##Producing Training/Testing inputs+output
from numpy import array, sin, cos, pi
from random import random

#Random initial angles
angle1 = random()
angle2 = random()

#The total 2*pi cycle would be divided into 'frequency'
#number of steps
frequency1 = 300
frequency2 = 200
#This defines how many steps ahead we are trying to predict
lag = 23

def get_sample():
    Returns a [[sin value, cos value]] input.
    global angle1, angle2
    angle1 += 2*pi/float(frequency1)
    angle2 += 2*pi/float(frequency2)
    angle1 %= 2*pi
    angle2 %= 2*pi
    return array([array([
        5 + 5*sin(angle1) + 10*cos(angle2),
        7 + 7*sin(angle2) + 14*cos(angle1)])])

sliding_window = []

for i in range(lag - 1):

def get_pair():
    Returns an (current, later) pair, where 'later' is 'lag'
    steps ahead of the 'current' on the wave(s) as defined by the

    global sliding_window
    input_value = sliding_window[0]
    output_value = sliding_window[-1]
    sliding_window = sliding_window[1:]
    return input_value, output_value

Essentially, you just need to call get_pair to get an ‘input, output’ pair – the output being 23 time intervals ahead on the curve. Each have the NumPy dimensionality of [1, 2]. The first value ‘1’ means that the batch size is 1 – we will feed one input at a time while training/testing.

Now, I don’t pass the input directly into the LSTM. I try to improve the LSTM’s understanding of the input, by providing its first and second derivatives as well. So, if the input at time t is x(t), the derivative is x'(t) = (x(t) – x(t-1)). Following the analogy, x”(t) = (x'(t) – x'(t-1)). Here’s the code for that:

#Input Params
input_dim = 2

#To maintain state
last_value = array([0 for i in range(input_dim)])
last_derivative = array([0 for i in range(input_dim)])

def get_total_input_output():
    Returns the overall Input and Output as required by the model.
    The input is a concatenation of the wave values, their first and
    second derivatives.
    global last_value, last_derivative
    raw_i, raw_o = get_pair()
    raw_i = raw_i[0]
    l1 = list(raw_i)
    derivative = raw_i - last_value
    l2 = list(derivative)
    last_value = raw_i
    l3 = list(derivative - last_derivative)
    last_derivative = derivative
    return array([l1 + l2 + l3]), raw_o

So the overall input to the model becomes a concatenated version of x(t), x'(t), x”(t). The obvious question to ask would be- Why not do this in the TensorFlow Graph itself? I did try it, and for some reason (which I don’t understand yet), there seems to seep in some noise into the Variables that act as memory units to maintain state.

But anyways, here’s the code for that too:

import tensorflow as tf
from tensorflow.models.rnn.rnn import *

#Input Params
input_dim = 2

##The Input Layer as a Placeholder
#Since we will provide data sequentially, the 'batch size'
#is 1.
input_layer = tf.placeholder(tf.float32, [1, input_dim])

##First Order Derivative Layer
#This will store the last recorded value
last_value1 = tf.Variable(tf.zeros([1, input_dim]))
#Subtract last value from current
sub_value1 = tf.sub(input_layer, last_value1)
#Update last recorded value
last_assign_op1 = last_value1.assign(input_layer)

##Second Order Derivative Layer
#This will store the last recorded derivative
last_value2 = tf.Variable(tf.zeros([1, input_dim]))
#Subtract last value from current
sub_value2 = tf.sub(sub_value1, last_value2)
#Update last recorded value
last_assign_op2 = last_value2.assign(sub_value1)

##Overall input to the LSTM
#x and its first and second order derivatives as outputs of
#earlier layers
zero_order = last_assign_op1
first_order = last_assign_op2
second_order = sub_value2
total_input = tf.concat(1, [zero_order, first_order, second_order])

If you have an idea of what might be going wrong, do leave a comment! In any case, the core model follows.


The Model

So heres the the TensorFlow model:

1) The Imports:

import tensorflow as tf
from tensorflow.models.rnn.rnn import *


2) Our input layer, as always, will be a Placeholder instance with the appropriate type and dimensions:

#Input Params
input_dim = 2

##The Input Layer as a Placeholder
#Since we will provide data sequentially, the 'batch size'
#is 1.
input_layer = tf.placeholder(tf.float32, [1, input_dim*3])


3) We then define out LSTM layer. If you are new to Recurrent Neural Networks or LSTMs, here are two excellent resources:

  1. This blog post by Christopher Olah
  2. This deeplearning.net post. It defines the math behind the LSTM cell pretty succinctly.

If you like to see implementation-level details too, then heres the relevant portion of the TensorFlow source for you.

Now the LSTM layer:

##The LSTM Layer-1
#The LSTM Cell initialization
lstm_layer1 = rnn_cell.BasicLSTMCell(input_dim*3)
#The LSTM state as a Variable initialized to zeroes
lstm_state1 = tf.Variable(tf.zeros([1, lstm_layer1.state_size]))
#Connect the input layer and initial LSTM state to the LSTM cell
lstm_output1, lstm_state_output1 = lstm_layer1(input_layer, lstm_state1,
#The LSTM state will get updated
lstm_update_op1 = lstm_state1.assign(lstm_state_output1)

We only use 1 LSTM layer. Providing a scope to the LSTM layer call (on line 8) helps in avoiding variable-scope conflicts if you have multiple LSTM layers.

The LSTM layer is followed by a simple linear regression layer, whose output becomes the final output.

##The Regression-Output Layer1
#The Weights and Biases matrices first
output_W1 = tf.Variable(tf.truncated_normal([input_dim*3, input_dim]))
output_b1 = tf.Variable(tf.zeros([input_dim]))
#Compute the output
final_output = tf.matmul(lstm_output1, output_W1) + output_b1


We have finished defining the model itself. But now, we need to initialize the training components. These help fine-tune the parameters/state of the model to make it ready for deployment. We won’t be using these components post training (ideally).


4) First, a Placeholder for the correct output associated with the input:

##Input for correct output (for training)
correct_output = tf.placeholder(tf.float32, [1, input_dim])

Then, the error will be computed using the LSTM output and the correct output as the Sum-of-Squares loss.

##Calculate the Sum-of-Squares Error
error = tf.pow(tf.sub(final_output, correct_output), 2)

Finally, we initialize an Optimizer to adjust the weights for the LSTM layer. I tried Gradient Descent, RMSProp as well as Adam Optimization. Adam works best for this model. Gradient Descent works really bad on LSTMs for some reason (that I can’t grasp right now). If you want to read more about Adam-Optimization, read this paper. I decided on the learning rate of 0.0006 after a lot of trial-and-error, and it seems to work best for the number of iterations I use (100k).

##The Optimizer
#Adam works best
train_step = tf.train.AdamOptimizer(0.0006).minimize(error)


5) Finally, we initialize the Session and all required Variables as always.

sess = tf.Session()
#Initialize all Variables

The Training

Here’s the rudimentary code I used for training the model:


actual_output1 = []
actual_output2 = []
network_output1 = []
network_output2 = []
x_axis = []

for i in range(80000):
    input_v, output_v = get_total_input_output()
    _, _, network_output = sess.run([lstm_update_op1,
                                    feed_dict = {
                                        input_layer: input_v,
                                        correct_output: output_v})


import matplotlib.pyplot as plt
plt.plot(x_axis, network_output1, 'r-', x_axis, actual_output1, 'b-')
plt.plot(x_axis, network_output2, 'r-', x_axis, actual_output2, 'b-')

Training takes almost a minute on my Intel i5 machine.

Consider the first wave. Initially, the network output is far from the correct one(The red one is the LSTM output):


But by the end, it fits pretty well:


Similar trends are seen for the second wave:



In practical scenarios, the state at which you end training would rarely be the state at which you deploy. Therefore, prior to testing, I ‘fastforward’ both the waves first. Then, I flush the contents of the LSTM cell (mind you, the learned matrix parameters for the individual functions don’t change).


for i in range(200):

#Flush LSTM state
sess.run(lstm_state1.assign(tf.zeros([1, lstm_layer1.state_size])))

And here’s the rest of the testing code:

actual_output1 = []
actual_output2 = []
network_output1 = []
network_output2 = []
x_axis = []

for i in range(1000):
    input_v, output_v = get_total_input_output()
    _, network_output = sess.run([lstm_update_op1,
                                 feed_dict = {
                                     input_layer: input_v,
                                     correct_output: output_v})


import matplotlib.pyplot as plt
plt.plot(x_axis, network_output1, 'r-', x_axis, actual_output1, 'b-')
plt.plot(x_axis, network_output2, 'r-', x_axis, actual_output2, 'b-')

Its pretty similar to the training one, except for one small difference: I don’t run the training op anymore. Therefore, those components of the Graph don’t work at all.

Here’s the correct output with the model’s output for the first wave:


And the second wave:w2_test

Thats all for now! I am not a deep learning expert, and I still experimenting with RNNs, so do leave comments/suggestions if you have any! Cheers!

Nelder-Mead Optimization


The Nelder-Mead method is a heuristic optimization technique, just like Genetic Algorithms or Particle Swarms. Optimization in this context refers to the problem of finding point(s) with the optimal value of an objective function in a search space. Optimal could mean a maximum or minimum value.

Whenever possible, we try to solve an optimization problem with an analytical method. For example, if you have the exact definition of f(x) as a differentiable function of x, you could use simple derivatives to compute the value of x at which f attains the optimal value. However, in many real-world problems, we don’t have such a crystal-clear notion of the function we are optimizing. A simple example would be the value of a stock market portfolio as a function of the stocks (and their weightages) in the portfolio, and time. Here, you have no way to define how the value of the overall portfolio is going to behave, and therefore analytical methods don’t make sense.

Heuristic optimization techniques try to remedy this by having a degree of randomness to them. This means that the same algorithm run with the same problem and the same initial conditions could yield different results with each run. Heuristic methods explore the search space stochastically, and keep a note of the ‘best’/most optimal points they come across on the way. Of course, the behaviour is not totally random, but guided by a set of rules.

The most important thing to remember is that heuristic methods never guarantee that they will find the most optimal solution. They give as an output the best point they came across during their search – which might have been a local optimum for all you know. But in practical scenarios, we usually concern ourselves with finding good solutions anyways- we don’t need them to be the best. In fact, finding the best solution with absolute certainty might be impossible/impractical in most cases (think of the stock market portfolio example again).

The Nelder-Mead method uses a geometrical shape called a simplex as its ‘vehicle’ of sorts to search the domain. This is why the technique is also called the Simplex search method. In layman’s terms, a simplex is the n-dimensional version of a ‘triangle’.

For 1 dimension, its a line.


For 2 dimensions, its a triangle.


For 3 dimensions, its a tetrahedron.


And so on…

The shape doesn’t have to be symmetrical/equilateral in any way. Note that an n-dimensional simplex has (n+1) vertexes.

The Algorithm

Nelder-Mead starts off with a randomly-generated simplex (We will discuss how, later). At every iteration, it proceeds to reshape/move this simplex, one vertex at a time, towards an optimal region in the search space. During each step, it basically tries out one or a few modifications to the current simplex, and chooses one that shifts it towards a ‘better’ region of the domain. In an ideal case, the last few iterations of this algorithm would involve the simplex shrinking inwards towards the best point inside it. At the end, the vertex of the simplex that yields that most optimal objective value, is returned.

The following video of a 2-D Nelder-Mead optimization should give you an intuitive understanding of what happens:

How does it work?

Lets assume we are dealing with an n-dimensional space. The current simplex consists of the following n + 1 points: x_1, x_2, …, x_{(n + 1)}. The function we are trying to optimize(maximize) is f(x). The algorithm takes the following steps during every iteration:

I. Ordering

All the points are ordered/sorted, such that the value of f for the first point is the lowest, and that for the last point is the highest. Let the indices of the first(worst), second(second worst) and last(best) points be h, s, l respectively.

II. Computing the Centroid

Consider all points except the worst (x_h). Compute their centroid(mean) as:

c = \frac{1}{n} \sum_{i \neq h} x_i

III. Transformation

This is the most important part of the whole process – transforming the simplex. Transformation takes place in one of the following ways:

i. Reflection

This is the first type of transformation that is tried out. You compute the ‘reflected’ point as:

x_r = c + \alpha (c - x_h)

\alpha is called the reflection parameter, and is usually equal to 1. x_r is essentially a point on the line joining c and x_h, but located away from x_h. This is done in an attempt to move the simplex in a direction thats away from the sub-optimal region around x_h. For a 2-D problem, heres what reflection looks like:

NelderMead_1Now, if f(x_s) < f(x_r) \leq f(x_l) – that is, that x_r is better than the second-worst point, but not better than the current best, we just replace x_h with x_r in the simplex, and move on to the next iteration.

ii. Expansion

If our luck is great and the reflected point x_r happens to be better than the current best (f(x_r) > f(x_l)), we try to explore the possibility further. In other words, we move a little bit more in the direction of x_r from c, in the hope of finding an even better solution. The expanded point is defined as:

x_e = c + \gamma (x_r - c)

\gamma is called the expansion parameter, and its value in most implementations is 2. Heres what 2-D Expansion looks like:


We then replace x_h with the better of the two points: x_e and x_r in the simplex. This is called ‘greedy optimization‘ – we have replaced the worst point with the better of the two options we had. Theres another variant called ‘greedy expansion‘, which replaces x_h with x_e, as long as f(x_e) > f(x_l). This is done irrespective of whether x_e is better than x_r. The intuition for this, comes from the fact that Expansion always leads to a bigger simplex, which means more exploration of the search space.

iii. Contraction

Suppose our reflection point was worst than x_s (i.e. the second worst point). In that case, we need to realise that the direction defined by x_r may not be the ideal one for moving. Therefore, we end up contracting our simplex.

The contraction point x_c is defined as:

x_c = c + \beta (x_h - c)

\beta is called the contraction parameter, and is usually equal to 0.5. In essence, we try to go against our exploration that we tried with x_r, instead contracting the simplex ‘inwards’. A 2-D contraction looks like this:

NelderMead_3If f(x_c) > f(x_h), which means that the contracted point is better than the current-worst, then we replace x_h with x_c in the simplex. However, if this is not the case, then we go to our last resort: the Shrink contraction.

iv. Shrink contraction

In this case, we redefine the entire simplex. We only keep the best point (x_l), and define all others with respect to it and the previous points. The jth new point, will now be defined as:

x_j = x_l + \delta (x_j - x_l)

\delta is called the shrinkage parameter, and is equal to 0.5 in most implementations. What we are essentially doing with the above definition, is moving each point in the simplex towards the current best point, in the hope of converging onto the best neighbourhood. A 2-D Shrinking transformation looks like:


It is important to note that this transformation is the most expensive of all the transformations, since we have to replace multiple (all but one) points in the simplex. However, it has been found (after multiple experiments), that the shrink transformation rarely happens in practice. As a result, many implementations of Nelder-Mead simply skip this step, doing the Contraction instead.


Termination is usually reached when:

1. A given number of iterations is reached

2. The simplex reaches some minimum ‘size’ limit.

3. The current best solution reaches some favorable limit.

The Initial Simplex

Like in the case of most heuristic techniques, theres no ‘best’ way to do the random initialization for Simplex search. However, heres one that supposedly works well (it is used in the MATLAB implementation of Nelder-Mead):

The initial point, x1, is provided by the user if he has some idea about a possible good solution. If not, its taken to be some random point. This is called the ‘guess‘ vertex. Now, remember we have n dimensions, and we need (n + 1) points – out of which 1 has already been decided. The other end points are initialized with respect to x_1, at a small distance along the direction of each dimension.

The (i + 1)th point x_{(i + 1)}, as defined using the ith dimension unit vector u_i, is given by:

x_{(i + 1)} = x_1 + h(x_1, i) * u_i

h(x_1, i) is defined as:

h(x_1, i) = 0.05 if the coefficient of u_i in the definition of x_1 is non-zero.

h(x_1, i) = 0.00025 if the coefficient of u_i in the definition of x_1 is zero.

Other Points

1. Nelder-Mead is generally used for parameter selection in Machine Learning. In essence, data scientists use the Simplex Method to figure out the optimum parameters for their models. This is mainly because this method tends to optimize the objective function fairly fast, and in an efficient way (especially in implementations where Shrinking is not used).

2. Even though Nelder-Mead tends to optimize the objective fairly fast (with few iterations), it tends to get stuck in local optima. In such cases, it may remain stuck in a sub-optimal neighbourhood for a long time, and the only solution to this is to restart the algorithm. Why this happens is fairly easy to understand. Once the simplex enters a region of local optimum, it will only keep contracting/shrinking, instead of randomly exploring other(far-away) parts of the search space. As a result, its becomes impossible for the algorithm to explore any further.

3. Scipy implementation of Nelder-Mead.


  1. Scholarpedia article
  2. Wikipedia article
  3. Jeff Heaton’s book
  4. This StackOverflow answer

Redis + Celery: Reactive Computing in Django for IoT applications

To quote Wikipedia, Reactive Programming is a programming paradigm oriented around data flows and the propagation of change.

Consider the simple equation a = b + c. This relationship defines a flow of data, in that it describes how the data for a is computed given the data for b and c. Suppose the values of b and c at time t=0 are 4 and 5 respectively. Therefore, a_{t=0} = 9. Now, lets say that the value of b changes at t=1 so that b_{t=1} = 10. By the definition of a mentioned before, it follows that a_{t=1}=15. Thus, the change in the value of b got propagated, causing the value of its ‘parent’ a to change. I use the word parent because if we were to draw the relationship between a, b and c as a tree, a be the parent of b and c (like in a data flow graph). A common example of reactive programming is seen in spreadsheet software like MS-Excel, where the change in the value of a particular cell causes a change in the values of cells that depend on it by some pre-defined equation.

RxJS and Meteor are two javascript-based frameworks that implement reactive programming for web-based applications.

Reactivity, in general, is a paradigm that’s difficult to find in Python-based systems, and that includes Django. This post describes a method to perform mathematical computations (not reactive programming in general, just mathematical calculations) in a reactive manner in Django-based systems. I will soon describe the exact problem solved by my code, with suggestions on how to modify it to suit your needs.


Heres the prerequisites, apart from Django:

i. We will use Celery to perform tasks asynchronously.

ii. We will use Redis as for temporary data storage, and synchronization.

This is a good place to look if you want to set up both.

iii. We will use python-redis-lock to use Redis for locking mechanisms. Locking ensures that one data update/read doesn’t get messed up due to another update that’s running in parallel.

The ‘Problem Statement’

Consider the two equations:

A = C + D + E …(1)

B = A * F …(2)

We will refer to A, B, C, D, E and F as computational nodes. C, D, E and F are input values that we acquire from certain sources of data (whatever they might be). They are essentially the independent variables in this context. A‘s value is derived C, D and E (making it their ‘parent’ node). You can also say that C, D and E are the arguments for A. B in turn has two dependencies, A and another independent variable F.

You don’t know when you will receive the values of the independent variables. You also don’t know what order they will arrive in. Some of them may come together, some very frequently, etc. etc. Lets say that your application dictates that the value of any node remains valid for about a minute its received or computed. Whenever you have had the latest values of C, D and E come within the past 60 seconds, you would want to compute the value of A. The same logic applies to B.

Once the latest value of A is computed, you would like the node to ‘forget’ the values of its arguments(children). Therefore, whenever comes the next time that we have fresh values of C, D and E together, A will be calculated again. The same logic once again applies to B. The code can be easily modified (and I will mention how) so that this restriction is lifted. In that case, A would be recomputed whenever the value of any of C, D and E changes, provided that the other values are still valid.

Whenever you receive/compute the value of a computational node (whether independent or derived), you would want to store its value – along with the appropriate timestamp – in your database.

Now lets see how this data-flow logic can be implemented in a flexible, efficient manner.

The Implementation

The whole functionality will be enclosed in a Django App called ‘reactive_compute’. It will have the following files:

i. models.py

ii. admin.py

iii. db.py

iv. functions.py

v. functions.py

and obviously

vi. __init__.py

First of all, we will need to define the derived nodes A and B, in terms of their Function and Number of Arguments. For example, A‘s function can be described as an Addition, with the number of arguments being 3. B on the other hand is a Multiplication, with 2 arguments. To remember this, we will first need to define the database model in models.py. You don’t store this information in Redis, since its ‘permanent’.

The model:

from django.db import models

class FunctionType(models.Model):
    Stores information about the type of function/computation
    associated with a node.
    The function name should have a corresponding mapping in

    node = models.CharField('Node', max_length=25)
    functionname = models.CharField('Function Name', max_length=25)
    noofargs = models.IntegerField('Number of Arguments')

As you can see, functionname is a String that will act as an identifier for the Function. We will have “ADD” for Addition, and “MUL” for Multiplication. But there must be someplace to implement those functions. Thats where the functions.py files comes in.

##File with all computational functions

def add(args):
    Addition Function.
    return sum(args)

def sub(args):
    Subtraction Function.
    return (args[0] - args[1])

def mul(args):
    Multiplication function.
    prod = 1
    for arg in args:
        prod *= arg
    return prod

class Functions(object):
    Class to store mapping between function name and function

    mapping = {
        'ADD': add,
        'SUB': sub,
        'MUL': mul

Adding the mapping as a dictionary to the Functions class allows easy importing. Each of the functions takes in the arguments(in appropriate order) as a list.

The following screenshot shows the appropriate database entries made for A and B in the Django Admin.

Screenshot from 2016-01-04 22:51:04

Now that we have defined the ‘nodes’ we need to direct the appropriate args to them. This is done by creating another model in models.py, for storing the data-flow information as tree ‘edges’. Every child is an argument for the parent.

class Dependency(models.Model):
    Models a Parent-Child relationship between computational
    nodes. Also defines the order in which the children are to be
    arranged, to get the Parent's value.

    parent = models.CharField('Parent Node', max_length=25)
    child = models.CharField('Child Node', max_length=25)
    argno = models.IntegerField('Argument Number')


argno doesn’t really matter for the operations of multiplication and addition, but for others(like subtraction), it might.

Heres a Django-Admin screenshot with the required entries for A and B.

Screenshot from 2016-01-04 22:52:05

As I mentioned earlier, we would like to store the values of the nodes, whenever they are updated. So heres a model for that:

class DataLog(models.Model):
    Stores information about the type of function/computation
    associated with nodes.

    node = models.CharField('Node', max_length=25)
    timestamp = models.DateTimeField('Time Stamp')
    value = models.FloatField('Value', null=True)

And heres the contents of the db.py file, that define a function to store the latest value of a node into the database:

from reactive_compute.models import DataLog
from django.utils import timezone

def save_node_value(node, value):
    Saves the latest computed value of the given node.
    data_entry = DataLog(node=str(node),

To present a complete example, the following are the contents of the admin.py file.

from django.contrib import admin
from reactive_compute.models import *

class DependencyAdmin(admin.ModelAdmin):
    list_display = ('parent', 'child', 'argno')
    list_filter = ['parent', 'child']
    search_fields = ['parent', 'child']

admin.site.register(Dependency, DependencyAdmin)

class FunctionTypeAdmin(admin.ModelAdmin):
    list_display = ('node', 'functionname', 'noofargs')
    list_filter = ['functionname']
    search_fields = ['node', 'functionname']

admin.site.register(FunctionType, FunctionTypeAdmin)

class DataLogAdmin(admin.ModelAdmin):
    list_display = ('node', 'timestamp', 'value')
    list_filter = ['node', 'timestamp']
    search_fields = ['node']

admin.site.register(DataLog, DataLogAdmin)

And finally, we come to the crux of the implementation – the tasks.py file. If you read the resource I mentioned for Celery installation, you know that the Celery task is defined in this file. I have heavily commented it, so do go through the in-line docs.

from __future__ import absolute_import
from celery import shared_task
from reactive_compute.models import Dependency, FunctionType
from reactive_compute.db import save_node_value
from reactive_compute.functions import Functions
from redis import StrictRedis
import redis_lock

def compute_nodes(nodes, timeout=60):
    Computes values for all computational nodes as defined in the
    FunctionType model. Recursive calls are made based on information
    present in respective Dependency entries.

    'nodes' should be a dict mapping name of computational node, to:
    A floating point number as input, OR
    None, if the value is to be (possibly) computed.
    'timeout' defines the time interval for which the values of the
    nodes mentioned in 'nodes' will be valid. Default is a minute.

    #First handle the boundary case
    if len(nodes) == 0:
        return None

    #Get a connection to Redis
    conn = StrictRedis()

    #This will contain all the parent nodes that we will try to compute
    #recursively based on the args currently provided.
    parents = set([])

    #Default initialization for Celery
    value = None

    #Iterate over all nodes
    for node in nodes:
        ##First obtain the value of the node
        if nodes[node] is not None:
            #Value has been provided as input
                #Ensure the given value can be parsed/converted to
                #a float.
                value = float(nodes[node])
                #If representing the value as a float fails,
                #skip this one.
            #Value has to be computed.

            #First acquire lock for the particular node.
            #This ensures that the inputs needed for computing
            #the current node don't get modified midway(unless
            #one expires, in which case the computation may or may
            #not go through).
            lock = redis_lock.RedisLock(conn, node + '_lock')
            if lock.acquire():
                    #This will indicate if all args are present for
                    #computation of the result
                    all_args_flag = True
                    #This will store all the required arguments in order
                    args = []
                    #Get the pertinent FunctionType instance
                    func_info = FunctionType.objects.get(node=node)
                    #Iterate over all arguments
                    for i in range(func_info.noofargs):
                        #Get Redis value
                        v = conn.get(node + '_' + `i`)
                        if v is None or v == 'None':
                            #If value not present stop iterations
                            all_args_flag = False
                    #If any arg was absent, abort processing of this node
                    if not all_args_flag:
                    #Compute the value, since all args are present
                    value = Functions.mapping[func_info.functionname](
                    #Delete info about current args
                    for i in range(func_info.noofargs):
                        conn.delete(node + '_' + `i`)
                    #Release lock

        ##Now that the value has been obtained, update the args info
        ##for all parent nodes
        parent_objs = Dependency.objects.filter(child=node)
        for parent_obj in parent_objs:
            #Get lock for parent
            lock = redis_lock.RedisLock(conn, parent_obj.parent + '_lock')
            if lock.acquire():
                    #Set value
                    conn.set(parent_obj.parent + '_' + `parent_obj.argno`,
                    #Set expiry time
                    conn.expire(parent_obj.parent + '_' + `parent_obj.argno`,
                    #Release lock
            #Add this parent to the set of parents to process

        #Save value in database as needed
        save_node_value(node, value)

    #Make the recursive call on parent nodes
    compute_nodes.delay(dict((parent, None) for parent in parents),

Given below is an example of using the above task in a simple Django view. It accepts the complete input as a stringified JSON object (like “\{"C": 4, "D": 5\}“), and makes a call to the Celery task.

from django.http import HttpResponse
from django.views.decorators.csrf import csrf_exempt
from reactive_compute.tasks import compute_nodes

def input_values(request):
    #Get JSON-ed values dictionary
    value_dict = eval(request.GET.get('input'))
    #Make Celery call

    return HttpResponse("OK")

Lets run the server and see how things work.
Heres that the data-logs look like given the input “\{"C": 4, "D": 5\}“.

Screenshot from 2016-01-04 22:55:12

Within a minute, we provide “\{"E": 6\}“.

Screenshot from 2016-01-04 22:55:29

Notice how A gets computed as well. Now let us provide all inputs, with new values: “\{"C": 1, "D": 1, "E": 1, "F": 3\}“.

Screenshot from 2016-01-04 22:56:08

As you must have observed, B is computed with the latest value of A, which is 3.

One unusual flexibility that this codes provide, is that we can update the values of nodes that are ‘usually’ derived, manually. Heres what happens if we provide “\{"A": 4, "F": 5\}“.

Screenshot from 2016-01-04 22:56:33

You don’t need to Copy-Paste this code anywhere, its all present in my github repo. Feel free to fork it and send a PR if you can modify it in some useful way!


1. The way its implemented, you have complete control over how each computational node works. I have used simple mathematical operations, but you could have your own metrics/functionality written in. As long as you define the arguments and order them right, things will work well.

2. I have worked with floating-point numbers, but your arguments could as well be vectors! Then you might have to run eval over your Redis values, instead of a simple float() conversion. But since every call is made as a Celery task, your runtime will still be optimized.

3. Earlier, I mentioned that you could remove the restriction of deleting previous values of arguments (as long as they are valid). You could do this pretty easily by removing lines 85-87 in the tasks.py code. If you don’t want your node values to have a sense of time-based validity at all, you could just remove lines 106 and 107 from tasks.py.

4. If you want different values to be stored in different ways into the database, you could add another parameter to the FunctionType model, that specified a String that denotes how information about that particular node is added to the database. Ideally, this database storing would also be a Celery task.

5. You could also store the Data-Flow diagram better using a No-SQL database such as MongoDB.

Thats all for now :-). Cheers!


Fuzzy Speciation in Genetic Algorithms using k-d Trees

You would need to know the basics of Genetic Algorithms to understand this post. AI Junkie’s tutorial is a good place to get to upto speed. You would also need to know what k-D Trees are, for which the Wiki article is a good reference.


What is speciation?

Speciation, in simple terms, is an evolutionary process by which new biological species arise. This concept is used in the implementation of evolutionary algorithms (such as GAs) as a form of optimization. To understand speciation, its necessary to understand what we mean by ‘species’.

A species is a group of organisms that – 1) possess some common prominent traits, and 2) can breed/mate with one-another. For example, consider cats and leopards. They share many characteristics, and hence exist in the same ‘family’ of animals. However, since they can’t really mate to provide a cat+leopard offspring, they don’t belong to the same species. The part 2) of the above definition may not be applicable in all cases (such as asexual plants). Nevertheless, it is what is important to us in the context of evolutionary algorithms.

Basically, a genetic algorithm (that implements speciation) will divide the population of candidates into a number of species. Essentially, each species will be a group of solutions that are allowed to crossover with each other. Candidates belonging to different species rarely mate.

Why do we need speciation?

A Genetic Algorithm chiefly deals with two genetic operations: Crossover and Mutation. Mutation works on a single candidate solution by perturbing it in some small way. Crossover, on the other hand, takes two candidate ‘parents’ (selected based on their fitness) and produces offspring(s) that share parts of their genome with the parents. This seems good enough, but it has some serious pitfalls.

Many times, crossover has a pretty high rate of failure (that is, it produces very inferior solutions). The reason for this, in simple terms, is: though both parents may individually have very high levels of fitness, the result of combining random parts of their genome together can be sub-optimal. This tends to happen when the parents come from distant parts of the search space, causing the offspring to lie in some entirely new region that has no optima in the vicinity. For example, imagine you are trying to evolve the fastest mammal. During a crossover, you select a deer (a fast mammal), and a cheetah(the fastest indeed). In this case, even if you could breed their offspring, it may have a random mixture of the anatomical features of its parents thats disastrous from the point of view of speed. In other words, mating fit parents does not guarantee a fit offspring, if the properties of the parents are vastly different.

In cases like these, speciation comes to the rescue. By restricting mating between greatly dissimilar candidates, evolution of sub-par candidates is minimized.

How is speciation implemented?

Usually, the two most common ways of implementing speciation in evolutionary algorithms are: Threshold Speciation and K-Means/Medoids clustering. Both are explained pretty well in Jeff Heaton’s book on Nature-Inspired algorithms in Artificial Intelligence.

The key feature of speciation is the fact that it prevents mating among candidates with highly differing characteristics. This can be thought of as promoting breeding between candidates that lie close to each other in the search-space. To achieve the same effect, I have been using k-d trees to perform nearest-neighbour searches to provide a form of speciation to my implementation of GAs. Its pretty simple(and intuitive) in terms of its approach, doesn’t degrade performance by too much and has been providing good results so far. So, here’s a basic outline of it for you all.

Implementation using k-D Trees

I will outline the algorithm first, and point out the nuances later.

First off, you must have a way of computing the ‘distance’/dissimilarity between candidate solutions. If your genome has a fixed length, you can use Euclidean distance.

Extra attributes required: 1. Starting Temperature (T), and 2. Neighbourhood Value(V). T must be in the range (0. 1). The higher this value, the greater is the tendency to neglect speciation (especially in the earlier iterations). The Neighbourhood Value V denotes the number of nearest candidates that will be considered in searches.


I) The following selection process is repeated in each iteration until the new population size does not reach the required value:

Initial Values: Temperature t = T

i. Pick a random candidate parent based on fitness (using something like Roulette-Wheel selection, or Tournament selection)

ii.a. Pick a random number in (0, 1). If it is lesser than the current value of t, pick another parent randomly using the same selection process used in step i. If not, goto step ii.b.

ii.b. Using the k-D Tree trained over candidates from the current iteration, find the V nearest candidates to the parent selected in step i. Choose one randomly (using the process employed in step i), based on fitness.

iii. Use crossover with parents produced in steps i and ii. Mutate as/when appropriate. Add offspring to the pool of candidates of the next generation.

II) Once the new population grows to the required size, a new k-D Tree is generated using the set of all new candidates for the next iteration.

III) Decrease t for the next iteration.


1) First of all, you need to have a way to compute dissimilarity between candidate solutions (apart from their fitness). As mentioned, if your genome size is constant, you could always use good-old Euclidean distance. If not, you might have to be creative. If you are using Scikit-Learn’s implementation of k-D Trees like me, you could check out the metrics it supports by doing kd_tree.valid_metrics over an instance.

2) Because the selection of the first parent in step i. depends only by the global measure for fitness, parts of the search space that generate good solutions will have higher numbers of candidates in subsequent iterations.

3) Unlike using K-Medoids for speciation, you don’t need to have a strict definition of the species that a given candidate belongs to. Nearest-neighbour searching ensures that a candidate usually mates only with other candidates that are similar to it.

4) As you might have understood, the Temperature parameter only adds some amount of exploratory tendencies to the algorithm. The higher the temperature, the greater is the tendency to ignore speciation. The decrease in t could most likely be done linearly from the user-provided starting point T, to a pre-defined minimum value(like 0.05), across iterations.

Effects on Performance of the GA

As mentioned in the discussion here, it doesn’t make much sense to analytically define the time-complexity of GAs. However, let me give you a rough idea of how things will change. Assume the total size of the population is n. First off, Roulette-Wheel selection (done in step i) takes O(n) time if done in a naive way (going through all candidates in some order one by one). If you do it using the stochastic acceptance method, it takes O(1) time. Usually, this selection will be done twice during every crossover. In case of the algorithm outlined above, for approximately (1-t) fraction of the second crossovers, the Roulette-wheel selection will be replaced by an O(log(n)) search using the k-D Tree and a much smaller (almost O(1)) Roulette Wheel selection amongst the nearest V candidates. Hence, unless your population size is very, very large, this won’t affect time that drastically.

The only significant increase in time will be seen as a result of the k-D Tree learning at the end of each iteration, which would add an O(nlog(n)) time to the time required for generating the next-gen pool of candidates (which is O(n^2) for naive implementations anyways). Since O(n^2 + nlog n) is equivalent to O(n^2), the performance decrease isn’t that significant, provided the size of your population isn’t really high. Even if you were using a highly efficient selection procedure during crossovers, for reasonable population sizes, your total run-time would only be multiplied by a factor less than 2 (mainly because crossover and mutation by themselves take some time too).

Thats all for now! Keep reading and cheers 🙂

Using Circular Self-Organizing Maps to solve the Symmetric TSP


1. Knowing what the Traveling Salesman Problem (TSP) is. The wiki article is a good place to start. The symmetrical form of the problem is where the distance from one city to another is the same in both directions.


2. A decent understanding of what Kohonen/Self-Organizing Maps are. This blog post will point you in the right direction to learn more.

After I wrote my former blog post on implementing SOMs with TensorFlow, someone on Reddit mentioned that there are ways to solve TSPs using them. When I did some research, it appeared that there have been a few attempts to do so. Heres one, and heres another. So in my spare time, I decided to try and have a crack at it myself.

I will describe my approach (along with the code) in a short while, but let me give a disclaimer first. This post in no way claims to have found the ‘best’ way to solve TSPs (ACOs are pretty much the state of the art). It just describes my attempt, and the results I obtained (which were interesting, with respect to the methodology implemented).

I obviously make a few assumptions, and here they are:

1. Every city has a corresponding vector available, denoting its coordinates/location in a 2D/3D space. This source presents many TSP benchmark data-sets in this format. I also tried assigning every city a vector where every dimension is its distance from all the cities(including itself). This gives pretty sub-par results, so I decided not to pursue that. Maybe you could use a dimensionality reduction technique like PCA on these ‘distance’ vectors and get good results.

2. Because of the above point, my method is pretty much restricted to solving the symmetric form of a TSP (atleast the way I have done it).


The Method

Suppose the total number of neurons is N. The total number of cities in question is C. After some trial and error, I set N = 2C. It can be higher/lower as well. The lower the number, the less optimal is the understanding of the solution space. And as you go higher, after a certain point, there seems to be no improvement in the output (but your execution time increases many-fold).

The individual neurons are provided locations around a circle, as follows:

l_i = ( cos(\frac{2i\pi}{N}), sin(\frac{2i\pi}{N}) )

where l_i is the location vector for the i th neuron in a 2-D space. i goes from 0 to N - 1. It is important to not confuse this space with the space in which the cities lie (even though both are 2-dimensional). Lets call one the Neuron-Space, and the other, the City-Space.

Because of the circular nature of this SOM, we inherently have a closed loop-structure to our framework (as is needed by the TSP solution), where neurons in proximity of each other in the Neuron-Space lie near-by in the City-Space. Also, do note how every neuron essentially corresponds to an angle ( \frac{2i\pi}{N}) around the circle. This point would be useful soon.

The initialization of the neuron vectors in City-Space is done by using the mean and standard deviation of the dimensions of the city vectors in City-Space. This is because I observed that the scale of all the dimensions in the city-vectors is pretty much the same. If it isn’t, you might have to resort to building a mean vector and covariance matrix. For training, I use \alpha = 0.3 and \sigma = 0.3 as parameters for the SOM. Putting a constant value of \sigma ensures that the same proportion of neurons around the Best-Matching Unit get affected everytime, irrespective of the total number of neurons being used.

Once the SOM trained, we can say we have the rough ‘skeleton’ for a good route in the City-Space. But how do you convert this skeleton to the appropriate route? By using those angles I just talked about. Consider a city c. Lets say the k nearest neurons to it, are n_1, n_2, ..., n_k.

So, we can construct a vector for every city in Neuron-Space, using this simple formula:

l_c = \sum_k \frac{1}{d_{c, n_k}} l_c

where d_{c, n_k} is the distance between city c and neuron n_k in City-Space. Using this vector and some pretty simple (inverse) trigonometry, you can assign the city an angle around the circle on which the SOM-neurons themselves lie in Neuron-Space. Now all you need to do to come up with an approximately good solution, is just sort the cities with respect to the angles they are assigned. Voila!

Ofcourse, the solution that this algorithm gives you would most likely be in the neighbourhood of a slightly better one, so I use a standard TSP local search heuristic like 3-opt over the solution obtained from my algorithm. What 3-opt essentially does, is change around the solution obtained from some other algorithm over so slightly in some pre-defined ways, in the hope of hitting on a better solution. 3-opt is better than 2-opt for this algorithm. I dint try higher-order k-opt searches.

One interesting possibility about this algorithm, is the fact that it could provide great solutions to solving TSPs for different target cities, without retraining the SOM– as long as they lie in the same City-Space. As long as the set of cities you try to solve for follows (atleast approximately) the trends shown by the dataset used for training, all you would need to do is compute their respective angles (which does not involve re-training), and just apply 3-opt. In short, you could use the same SOM, once trained, to solve for multiple subsets of cities.

The Implementation

This code builds on my Tensorflow-based SOM code. Its pretty rough around the edges, especially the 3-opt search. It was meant just as a proof-of-concept, so feel free to hack around if you like.

import tensorflow as tf
import numpy as np

def _distance(tsp_solution, distance_matrix):
    Total distance for a solution.
    total_distance = 0
    for i in range(len(tsp_solution)-1):
        total_distance += (distance_matrix[tsp_solution[i]][
            tsp_solution[i + 1]])
    total_distance += (distance_matrix[tsp_solution[0]][
    return total_distance

def _3_opt(route, distance_matrix):
    Uses 3-opt to find a better solution in the neighbourhood.

    current_dist = _distance(route, distance_matrix)
    #Find three connections to break
    for i in range(len(route) - 1):
        for j in range(i + 2, len(route) - 1):
            for k in range(j + 2, len(route) - 1):
                #Now reconnect the graph in every other way possible,
                #and see if any gives a better solution
                #Way 1
                temp = route[:]
                temp[j+1:k+1] = reversed(route[j+1:k+1])
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 2
                temp = route[:]
                temp[i+1:j+1] = reversed(route[i+1:j+1])
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 3
                temp = route[:]
                temp[i+1:j+1], temp[j+1:k+1] = (
                    reversed(route[i+1:j+1]), reversed(route[j+1:k+1]))
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 4
                temp = (route[:i+1] + route[j+1:k+1] +
                        route[i+1:j+1] + route[k+1:])
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 5
                temp = route[:i+1] + route[j+1:k+1]
                temp += reversed(route[i+1:j+1])
                temp += route[k+1:]
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 6
                temp = route[:i+1]
                temp += reversed(route[j+1:k+1])
                temp += route[i+1:j+1]
                temp += route[k+1:]
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 7
                temp = route[:i+1]
                temp += reversed(route[j+1:k+1])
                temp += reversed(route[i+1:j+1])
                temp += route[k+1:]
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
    return route

class TSPSolver(object):
    Uses a Circular Self-Organizing Map with a Gaussian Neighbourhood
    function and linearly decreasing learning rate, to solve TSP.
    #To check if the SOM has been trained
    _trained = False
    def __init__(self, input_vects, n_iterations=200):
        Initializes all necessary components of the TensorFlow

        'input_vects' should be a list of 1-D NumPy vectors denoting
        location vectors for the cities.
        'n_iterations' is an integer denoting the number of iterations
        that would be run while training the SOM.
        #Assign required variables first
        self._n = len(input_vects)
        dim = len(input_vects[0])
        self._n_centroids = self._n*3
        alpha = 0.3
        sigma = 0.3
        self._n_iterations = abs(int(n_iterations))

        #For initialization of the centroid(neuron) vectors
        init_values = []
        for vect in input_vects:
        init_mean = np.mean(init_values)
        init_dev = np.sqrt(np.var(init_values))
        self._graph = tf.Graph()
        with self._graph.as_default():
            #Randomly initialized weightage vectors for all neurons,
            #stored together as a matrix Variable of size [self._n, dim]
            self._weightage_vects = tf.Variable(tf.random_normal(
                [self._n_centroids, dim], init_mean, init_dev))
            #Matrix of size [self._n, 2] for SOM grid locations
            #of neurons
            self._location_vects = tf.constant(np.array(
            #We need to assign them as attributes to self, since they
            #will be fed in during training
            #The training vector
            self._vect_input = tf.placeholder("float", [dim])
            #Iteration number
            self._iter_input = tf.placeholder("float")
            #Only the final, 'root' training op needs to be assigned as
            #an attribute to self, since all the rest will be executed
            #automatically during training
            #To compute the Best Matching Unit given a vector
            #Basically calculates the Euclidean distance between every
            #neuron's weightage vector and the input, and returns the
            #index of the neuron which gives the least value
            bmu_index = tf.argmin(tf.sqrt(tf.reduce_sum(
                tf.pow(tf.sub(self._weightage_vects, tf.pack(
                    [self._vect_input for i in range(
                        self._n_centroids)])), 2), 1)),
            #This will extract the location of the BMU based on the BMU's
            slice_input = tf.pad(tf.reshape(bmu_index, [1]),
                                 np.array([[0, 1]]))
            bmu_loc = tf.reshape(tf.slice(self._location_vects, slice_input,
                                          tf.constant(np.array([1, 2]))),
            #To compute the alpha and sigma values based on iteration
            learning_rate_op = tf.sub(1.0, tf.div(self._iter_input,
            _alpha_op = tf.mul(alpha, learning_rate_op)
            _sigma_op = tf.mul(sigma, learning_rate_op)
            #Construct the op that will generate a vector with learning
            #rates for all neurons, based on iteration number and location
            #wrt BMU.
            bmu_distance_squares = tf.reduce_sum(tf.pow(tf.sub(
                self._location_vects, tf.pack(
                    [bmu_loc for i in range(self._n_centroids)])), 2), 1)
            neighbourhood_func = tf.exp(tf.neg(tf.div(tf.cast(
                bmu_distance_squares, "float32"), tf.pow(_sigma_op, 2))))
            learning_rate_op = tf.mul(_alpha_op, neighbourhood_func)
            #Finally, the op that will use learning_rate_op to update
            #the weightage vectors of all neurons based on a particular
            learning_rate_multiplier = tf.pack([tf.tile(tf.slice(
                learning_rate_op, np.array([i]), np.array([1])), [dim])
                                               for i in range(
            weightage_delta = tf.mul(
                tf.sub(tf.pack([self._vect_input for i in range(
            new_weightages_op = tf.add(self._weightage_vects,
            self._training_op = tf.assign(self._weightage_vects,
            self._sess = tf.Session()
            init_op = tf.initialize_all_variables()

            #Train the Solver

    def solution(self):
        Returns the TSP solution order as a list, and the total
        Euclidean distance with respect to the input vectors.
        if not self._trained:
            raise ValueError("SOM not trained yet")
        return self._sol, self._sol_dist

    def _neuron_locations(cls, n_centroids):
        Yields one by one the 2-D locations of the individual neurons
        in the SOM.

        for i in range(n_centroids):
            yield np.array([np.cos(i*2*np.pi/float(n_centroids)),
    def train(self, input_vects):
        Trains the SOM.
        'input_vects' should be a list of 1-D NumPy arrays with
        dimensionality as provided during initialization of this SOM.
        The points in the solution are enumerated as they are ordered
        in 'input_vects'.
        Current weightage vectors for all neurons(initially random) are
        taken as starting conditions for training.
        #Training iterations
        for iter_no in range(self._n_iterations):
            #Train with each vector one by one
            for input_vect in input_vects:
                               feed_dict={self._vect_input: input_vect,
                                          self._iter_input: iter_no})
        self._input_vects = input_vects
        self._trained = True
    def _compute_solution(self):
        Computes the solution to the TSP with respect to the input
        #TODO: All nearest-neighbour searches could be speeded up
        #significantly with a KD-Tree/Ball Tree.

        #Get the centroid data 
        centroid_vects = list(self._sess.run(self._weightage_vects))
        centroid_locations = list(self._sess.run(self._location_vects))

        #Distance matrix mapping input point number to list of distances
        #from centroids
        distances = {}

        for point in range(self._n):
            distances[point] = []
            for centroid in range(self._n_centroids):
                    np.linalg.norm(centroid_vects[centroid] -

        #Distance matrix mapping input point number to list of distances
        #from other input points
        point_distances = {}

        for point in range(self._n):
            point_distances[point] = []
            for other_point in range(self._n):
                    np.linalg.norm(self._input_vects[other_point] -

        #Compute angle with respect to each city(point)
        point_angles = {}
        for point in range(self._n):
            total_vect = 0
            cents = [j for j in range(self._n_centroids)]
            cents.sort(key=lambda x: distances[point][x])
            for centroid in cents[:2]:
                total_vect += (1.0/(distances[point][centroid]) *
            total_vect = total_vect/np.linalg.norm(total_vect)
            if total_vect[0] > 0 and total_vect[1] > 0:
                angle = np.arcsin(total_vect[1])
            elif total_vect[0] < 0 and total_vect[1] > 0:
                angle = np.arccos(total_vect[0])
            elif total_vect[0] < 0 and total_vect[1] < 0:
                angle = np.pi - np.arcsin(total_vect[1])
                angle = 2*np.pi - np.arccos(total_vect[0])
            point_angles[point] = angle

        #Find the rough solution
        tsp_solution = [i for i in range(self._n)]
        tsp_solution.sort(key=lambda x: point_angles[x])

        tsp_solution = _3_opt(tsp_solution, point_distances)

        #Compute the total distance for the solution
        total_distance = _distance(tsp_solution, point_distances)

        self._sol = tsp_solution
        self._sol_dist = total_distance


Each of these problem datasets can be found at the source here. All the presented results are over 10 runs. The max runtime was experienced with krA100, around one minute.

Berlin-52: Best Error: 0.0%, Average Error: 4.1%.

krA100: Best Error: 1.6%, Average Error: 2.7%.

eli76: Best Error: 2%, Average Error: 4.4%.

These results aren’t ‘close’ to those provided by ACOs, but they are practically feasible, and show a good promise (Oh now I just sound like I am writing a research paper.) Plus, the algorithm isn’t a time-hogger either (judging by runtimes on my laptop).

Anyways. Thanks for reading! Do drop a comment if you think of an improvement over this work.

Backpropagation for dummies

Backpropagation is a method of training an Artificial Neural Network. If you are reading this post, you already have an idea of what an ANN is. However, lets take a look at the fundamental component of an ANN- the artificial neuron.


The figure shows the working of the ith neuron (lets call it N_i) in an ANN. Its output (also called its activation) is a_i. The jth neuron N_j provides one of the inputs to N_i.

How does N_i produce its own activation?

1. N_i has stored in itself, a weightage assigned to each of its inputs. Lets say that the weightage assigned by N_i to N_j is w_{i, j}. As the first step, N_i computes a weighted sum of all its inputs. Lets call it in_i. Therefore,

in_i = \sum_{k \in Inputs(N_i)} w_{i, k}a_k … (1)

where Inputs(N_i) corresponds to all neurons that provide an input to N_i.

2. This sum, called the input function, is then passed on to N_i‘s activation function. It is called so, because it computes the activation that will be provided by N_i to the units it in turn acts as an input to. Therefore,

a_i = g(in_i) … (2)

Usually, the activation functions of all neurons in an ANN are the same, so we just write g instead of g_i.

There are many options for g, such as linear functions, rectified linear functions, hyperbolic tan, etc. For the sake of this tutorial, we will use the sigmoid function (I’ll explain why later). So we have

g(x) = \frac{1}{1 + e^{-x}} … (3)


Thus, every neuron in an ANN – except the ones in the lowest-level input layer, and bias neurons – takes in the activation of all its inputs, and provides its own activation as an output.

Once a neural network has been designed, it is the job of the training process to ensure that the individual weightages w_{i, j} given by each neuron to each of its inputs is set just right, so that the whole network gives the right output. Backpropagation is one of the ways to optimize those weights.

Whenever we talk of optimization, it is necessary to properly define what we are optimizing and how we are doing it.

What are we trying to optimize?

The error function. This function defines how much the output of our model (in this case an ANN) defers from the required output. For the sake of this tutorial, we will use the Mean-Squared Error(MSE) function. For example, let N_o be one of the output neurons of our ANN. For a given training sample, let the expected output be t_o. However, N_o provides an output equal to its activation value, that is a_o. Therefore, the error as defined by MSE would be

E = \frac{1}{2} (t_o - a_o)^2 … (4)

Since a_o by its definition depends on the weightages given by N_o to its inputs, the ultimate valye of E depends on those weightages too.

It is important to remember that every training sample will result in a different error value. It is backpropagation’s goal to minimize (bring as close to zero as possible) the value of E for all training samples on an average. The way this is done, as mentioned before, is by fine-tuning the values of the weights w_{i, j}.

But how does backpropagation fine tune these weightage values? By using a technique called Gradient Descent.

What is Gradient Descent?

Consider a function as follows:

f(x) = 3x + 4

The gradient of the function f with respect to the variable x defines how much the value of f will change with a unit increase/decrease in the value of x. For the above example, the answer to that is 3. Basically, you are differentiating f with respect to x. Generally speaking, a function may be dependent on a variety of parameters, so you take a partial derivative when computing the gradient.

Consider you are on a hill like the one shown in the image below:

Gradient_descent_1The red regions correspond to places of higher function value, while the blue regions correspond to low values. You want to reach the lowest point possible. If you are standing on the red hill, and you can only see a small part of the terrain around you, what will you do? You will naturally take steps in the direction in which the terrain slopes down. With each step you take, you will scan your immediate neighbourhood (which is visible to you), and go in the direction that shows the steepest descent.

In mathematical terms, this ‘direction’ is nothing but the gradient of your height (the target function you are trying to minimize), with respect to your location (x, y) (which are the parameters you are fine tuning). The overall direction in which you move, will be dependent on the gradients with respect to both dimensions, i.e. \frac{\partial height}{\partial x} and \frac{\partial height}{\partial y} . That’s pretty much what gradient descent does.

Its called so, because you are descending from a higher value of the target function to a lower value, by following the direction that corresponds to the lowest gradient (and consequently the steepest decrease in function value).

One thing to note is that you would always want to move in the direction shown by the negation of the gradient. Consider the above example of f(x) = 3x + 4 . The gradient here is 3, which is positive. So, with every unit increase in x, f(x) would increase by 3. Therefore, to reduce f(x), you would decrease the value of x . The opposite is true in scenarios where you want to maximize the target function. In those cases, you move in the direction of the gradient itself.

Coming back to the context of ANN training, backpropagation reduces the error function E by taking help of the gradients with respect to individual weights, \frac{\partial E}{\partial w_{i, j}} .

How does backpropagation employ gradient descent?

First off, it is necessary to keep in mind that backpropagation tries to optimize the inter-neuron weights, and not the neurons themselves. Their functionality is pretty much decided when you decide the activation function.

Secondly, the error function value E depends on the current values of the weights and the training sample in question, apart from the definition of the activation function. So to compute the error with respect to a training sample, you first feed it as an input to the network. You will then receive the appropriate output(s) from the output neuron(s). This part is known as  ‘forward propagation‘, since you are going from the input layer to the output layer (through the hidden layers if any). If you have multiple output neurons, the error values corresponding to each will affect the weights appropriately.

Then, you use the network output(s) along with the expected value(s) to calculate the errors using Equation 4 given above. Once the error values are known, the weights are updated using the gradients. This updating occurs first for the out-most weights, and then for lower layers. The intuition behind this, is the fact that the errors with respect to the outmost neurons are known first, and you derive the errors of the inner neurons from them. Each layer’s errors are calculated from the errors of the neurons they provide input to. That is why the second phase is called ‘backward propagation‘, or back-propagation.

So as a first step, lets try to understand how we will calculate the gradient \frac{\partial E}{\partial w_{i, j}} for a given training sample. Remember that  w_{i, j} is the weightage given by N_i to N_j.

Using the chain rule in Leibniz’s form,

\frac{\partial E}{\partial w_{i, j}} = \frac{\partial E}{\partial in_i} \frac{\partial in_i}{\partial w_{i, j}} … (5)

From Equations 1 and 5, and using basic differentiation,

\frac{\partial E}{\partial w_{i, j}} = a_j \frac{\partial E}{\partial in_i} … (6)

From the basic definition of the activation function (and the appropriate current weight values), you already know the value of a_j. Now consider the second term in the above product. Using the chain rule again on it, we get

\frac{\partial E}{\partial in_i} = \frac{\partial E}{\partial a_i} \frac{\partial a_i}{\partial in_i} … (7)

This is where the whole advantage of using the sigmoid activation function comes into play. The derivative of a sigmoid function with respect to its input, corresponds to a nice expression. Given equation 3, you can easily verify that

\frac{dg}{dx} = g(x)(1 - g(x))  … (8)

From Equations 7 and 8,

\frac{\partial E}{\partial in_i} = a_i (1 - a_i) \frac{\partial E}{\partial a_i} … (9)

Putting the information in Equations 6 and 9 together,

\frac{\partial E}{\partial w_{i, j}} = a_j a_i (1 - a_i) \frac{\partial E}{\partial a_i} … (10)

Look at the last term. There are two possible cases for N_i (N_i cannot be an input neuron, since it is receiving input from another neuron) :

I. N_i is an output neuron. In this case, the second term on the RHS is pretty simple, from Equation 4-

\frac{\partial E}{\partial a_i} = - (t_i - a_i) … (11)

where t_i is the expected output from N_i (for the current training sample).

2. N_i is an inner neuron. Consider all neurons N_k, for whom N_i acts an an input. Since we are propagating backward, you can therefore assume that we know the values \frac{\partial E}{\partial a_k} for all of them. In fact, since the output of N_i affects the outputs of all these N_k s, the expression for  \frac{\partial E}{\partial a_i} will involve all of them.

Using simple chain rule,

\frac{\partial E}{\partial a_i} = \sum_{k} \frac{\partial in_k}{\partial a_i} \frac{\partial E}{\partial in_k} … (12)

From Equations 12, 1 and 9,

\frac{\partial E}{\partial a_i} = \sum_{k} w_{k, i} a_k (1 - a_k) \frac{\partial E}{\partial a_k} … (13)

Equations 10, 11 and 13 pretty much define the gradient values \frac{\partial E}{\partial w_{i, j}} for all weights in a given ANN. So how are the weights actually updated?

Using the following rule:

The change caused by a given training sample, at time instance ‘t’, for the weight w_{i, j} is given by

\Delta w_{i, j}^{t} = - \epsilon \frac{\partial E}{\partial w_{i, j}} + \alpha \Delta w_{i, j}^{t - 1}

Remember me mentioning that we always need to move in the direction of the gradient’s negation? Thats why the minus sign for the first term. The value \epsilon is called the learning rate. We should never be too hasty in changing the weight’s value too drastically based on the gradient with respect to one training sample. Therefore, the gradient value is multiplied by the learning rate (usually well below 0.5) before causing the required change in the weightage value. If the learning rate is too low, the algorithm takes too long to train. On the other hand, if its too high, the algorithm keeps ‘bouncing around’ in the search space. Imagine you are running towards point B from point A. But you run so fast that you cant stop in time at point B, but you actually end up far ahead of it. Then you again run in the opposite direction towards B, once again crossing it and stopping ahead because of your high speed. Thats pretty much what happens if your learning rate is too large.

A variant of backpropagation, called resilient propagation(RPROP) improves on this classic method by having custom deltas with respect to each weightage in a network. It only uses the gradients as references to determine which direction to move in. Therefore, you don’t have to provide the learning rate as a parameter to resilient propagation.

Now look at the second term in the RHS above. The value \alpha is called the momentum. Sometimes, gradient descent tends to get stuck in local optima in the search space. A local optimum is a point where the value is low compared to the region around it, but not all of the domain overall. The momentum term causes the current weight update to be dependent on the last weight update (and therefore not be drastically different), thus preventing getting stuck in local optima in most cases. If it is indeed a global optimum, the search will eventually come back to the required point as the momentum will go on reducing. Nesterov Momentum is another way of optimizing the use of momentum in gradient descent.

How frequently are the weights updated?

There are 3 answers to this:

1. Online Mode

In this mode, the weights are updated after every training sample. In other words, the forward propagation as well as the backward propagation phases are run for each training sample, every single time. This mode is appropriate if you are receiving your training data sequentially. However, if your training set is large, this mode of training can be pretty time-consuming. Moreover, it is not appropriate to update weights after every sample (in most cases), due to the possible presence of outliers.

2. Batch Mode

In this mode, the weights are updated after accumulating gradients with respect to all samples in the training set. In other words, the forward propagation is run with respect to all samples, and the backward phase is run just once with respect to the accumulated results of the entire batch. This is done for multiple iterations (obviously). The batch mode of training is thus more resistant to variance in the training dataset.

3. Stochastic Mode

The stochastic mode adds randomization to the mix. In the mini-batch method of stochastic learning, small sets of samples are chosen randomly from the dataset, and training is done in batch mode over these mini-batches. Stochastic mode with mini-batches provides a nice trade-off between the pros and cons of online and batch modes of training. Moreover, since you are randomly choosing the mini batches (instead of organizing your dataset into pre-defined subsets), you avoid getting stuck in local optima. This is mainly because you are providing weight updates with respect to a potentially different subset of training samples during every iteration.

EDIT 14/01/2016

Xavier’s initialization

Usually, the weights in a Neural Network are initialized to random values prior to training. This is okay if your network is shallow. But for deep learning applications, where the number of hidden layers may be 3 or more, a wrong selection of initial weights can cause convergence to sub-optimal models.

For example, assume that you are using a sigmoid activation function and your initial weights are pretty small (in the 0-1 range). If you manually calculate the activation values in a neural network across 1-2 layers, you will realize that almost all neurons in the topmost layer give activations around 0.5 (the mean value for the sigmoid function). This phenomenon is called ‘saturation’. Its described well in this paper by Glorot and Bengio.

The solution to this problem, is Xavier’s initialization.  This technique tries to optimize the generation of initial weights, by applying simple statistics over the number of input and output connections corresponding to a neuron. A good intuitive explanation of the method is given at Andy’s blog. As a note to my future self, here’s the TensorFlow implementation of Xavier’s algorithm.

Self-Organizing Maps with Google’s TensorFlow

[This post assumes that you know the basics of Google’s TensorFlow library. If you don’t, have a look at my earlier post to get started.]


A Self-Organizing Map, or SOM, falls under the rare domain of unsupervised learning in Neural Networks. Its essentially a grid of neurons, each denoting one cluster learned during training. Traditionally speaking, there is no concept of neuron ‘locations’ in ANNs. However, in an SOM, each neuron has a location, and neurons that lie close to each other represent clusters with similar properties. Each neuron has a weightage vector, which is equal to the centroid of its particular cluster.

AI-Junkie’s post does a great job of explaining how an SOM is trained, so I won’t re-invent the wheel.

The Code

Here’s my code for a 2-D version of an SOM. Its written with TensorFlow as its core training architecture: (Its heavily commented, so look at the inline docs if you want to hack/dig around)

import tensorflow as tf
import numpy as np

class SOM(object):
    2-D Self-Organizing Map with Gaussian Neighbourhood function
    and linearly decreasing learning rate.

    #To check if the SOM has been trained
    _trained = False

    def __init__(self, m, n, dim, n_iterations=100, alpha=None, sigma=None):
        Initializes all necessary components of the TensorFlow

        m X n are the dimensions of the SOM. 'n_iterations' should
        should be an integer denoting the number of iterations undergone
        while training.
        'dim' is the dimensionality of the training inputs.
        'alpha' is a number denoting the initial time(iteration no)-based
        learning rate. Default value is 0.3
        'sigma' is the the initial neighbourhood value, denoting
        the radius of influence of the BMU while training. By default, its
        taken to be half of max(m, n).

        #Assign required variables first
        self._m = m
        self._n = n
        if alpha is None:
            alpha = 0.3
            alpha = float(alpha)
        if sigma is None:
            sigma = max(m, n) / 2.0
            sigma = float(sigma)
        self._n_iterations = abs(int(n_iterations))

        self._graph = tf.Graph()

        with self._graph.as_default():


            #Randomly initialized weightage vectors for all neurons,
            #stored together as a matrix Variable of size [m*n, dim]
            self._weightage_vects = tf.Variable(tf.random_normal(
                [m*n, dim]))

            #Matrix of size [m*n, 2] for SOM grid locations
            #of neurons
            self._location_vects = tf.constant(np.array(
                list(self._neuron_locations(m, n))))

            #We need to assign them as attributes to self, since they
            #will be fed in during training

            #The training vector
            self._vect_input = tf.placeholder("float", [dim])
            #Iteration number
            self._iter_input = tf.placeholder("float")

            #Only the final, 'root' training op needs to be assigned as
            #an attribute to self, since all the rest will be executed
            #automatically during training

            #To compute the Best Matching Unit given a vector
            #Basically calculates the Euclidean distance between every
            #neuron's weightage vector and the input, and returns the
            #index of the neuron which gives the least value
            bmu_index = tf.argmin(tf.sqrt(tf.reduce_sum(
                tf.pow(tf.sub(self._weightage_vects, tf.pack(
                    [self._vect_input for i in range(m*n)])), 2), 1)),

            #This will extract the location of the BMU based on the BMU's
            slice_input = tf.pad(tf.reshape(bmu_index, [1]),
                                 np.array([[0, 1]]))
            bmu_loc = tf.reshape(tf.slice(self._location_vects, slice_input,
                                          tf.constant(np.array([1, 2]))),

            #To compute the alpha and sigma values based on iteration
            learning_rate_op = tf.sub(1.0, tf.div(self._iter_input,
            _alpha_op = tf.mul(alpha, learning_rate_op)
            _sigma_op = tf.mul(sigma, learning_rate_op)

            #Construct the op that will generate a vector with learning
            #rates for all neurons, based on iteration number and location
            #wrt BMU.
            bmu_distance_squares = tf.reduce_sum(tf.pow(tf.sub(
                self._location_vects, tf.pack(
                    [bmu_loc for i in range(m*n)])), 2), 1)
            neighbourhood_func = tf.exp(tf.neg(tf.div(tf.cast(
                bmu_distance_squares, "float32"), tf.pow(_sigma_op, 2))))
            learning_rate_op = tf.mul(_alpha_op, neighbourhood_func)

            #Finally, the op that will use learning_rate_op to update
            #the weightage vectors of all neurons based on a particular
            learning_rate_multiplier = tf.pack([tf.tile(tf.slice(
                learning_rate_op, np.array([i]), np.array([1])), [dim])
                                               for i in range(m*n)])
            weightage_delta = tf.mul(
                tf.sub(tf.pack([self._vect_input for i in range(m*n)]),
            new_weightages_op = tf.add(self._weightage_vects,
            self._training_op = tf.assign(self._weightage_vects,

            self._sess = tf.Session()

            init_op = tf.initialize_all_variables()

    def _neuron_locations(self, m, n):
        Yields one by one the 2-D locations of the individual neurons
        in the SOM.
        #Nested iterations over both dimensions
        #to generate all 2-D locations in the map
        for i in range(m):
            for j in range(n):
                yield np.array([i, j])

    def train(self, input_vects):
        Trains the SOM.
        'input_vects' should be an iterable of 1-D NumPy arrays with
        dimensionality as provided during initialization of this SOM.
        Current weightage vectors for all neurons(initially random) are
        taken as starting conditions for training.

        #Training iterations
        for iter_no in range(self._n_iterations):
            #Train with each vector one by one
            for input_vect in input_vects:
                               feed_dict={self._vect_input: input_vect,
                                          self._iter_input: iter_no})

        #Store a centroid grid for easy retrieval later on
        centroid_grid = [[] for i in range(self._m)]
        self._weightages = list(self._sess.run(self._weightage_vects))
        self._locations = list(self._sess.run(self._location_vects))
        for i, loc in enumerate(self._locations):
        self._centroid_grid = centroid_grid

        self._trained = True

    def get_centroids(self):
        Returns a list of 'm' lists, with each inner list containing
        the 'n' corresponding centroid locations as 1-D NumPy arrays.
        if not self._trained:
            raise ValueError("SOM not trained yet")
        return self._centroid_grid

    def map_vects(self, input_vects):
        Maps each input vector to the relevant neuron in the SOM
        'input_vects' should be an iterable of 1-D NumPy arrays with
        dimensionality as provided during initialization of this SOM.
        Returns a list of 1-D NumPy arrays containing (row, column)
        info for each input vector(in the same order), corresponding
        to mapped neuron.

        if not self._trained:
            raise ValueError("SOM not trained yet")

        to_return = []
        for vect in input_vects:
            min_index = min([i for i in range(len(self._weightages))],
                            key=lambda x: np.linalg.norm(vect-

        return to_return

A few points about the code:

1) Since my post on K-Means Clustering, I have gotten more comfortable with matrix operations in TensorFlow. You need to be comfortable with matrices if you want to work with TensorFlow (or any data flow infrastructure for that matter, even SciPy). You can code pretty much any logic or operational flow with TensorFlow, you just need to be able to build up complex functionality from basic components(ops), and structure the flow of data(tensors/variables) well.

2) It took quite a while for me to build the whole graph in such a way that the entire training functionality could be enclosed in a single op. This op is called during each iteration, for every vector, during training. Such an implementation is more in line with TensorFlow’s way of doing things, than my previous attempt with clustering.

3) I have used a 2-D grid for the SOM, you can use any geometry you wish. You would just have to modify the _neuron_locations method appropriately, and also the method that returns the centroid outputs. You could return a dict that maps neuron location to the corresponding cluster centroid.

4) To keep things simple, I haven’t provided for online training. You could do that by having bounds for the learning rate(s).

Sample Usage

I have used PyMVPA’s example of RGB colours to confirm that the code does work. PyMVPA provides functionality to train SOMs too (along with many other learning techniques).

Here’s how you would do it with my code:

#For plotting the images
from matplotlib import pyplot as plt

#Training inputs for RGBcolors
colors = np.array(
     [[0., 0., 0.],
      [0., 0., 1.],
      [0., 0., 0.5],
      [0.125, 0.529, 1.0],
      [0.33, 0.4, 0.67],
      [0.6, 0.5, 1.0],
      [0., 1., 0.],
      [1., 0., 0.],
      [0., 1., 1.],
      [1., 0., 1.],
      [1., 1., 0.],
      [1., 1., 1.],
      [.33, .33, .33],
      [.5, .5, .5],
      [.66, .66, .66]])
color_names = \
    ['black', 'blue', 'darkblue', 'skyblue',
     'greyblue', 'lilac', 'green', 'red',
     'cyan', 'violet', 'yellow', 'white',
     'darkgrey', 'mediumgrey', 'lightgrey']

#Train a 20x30 SOM with 400 iterations
som = SOM(20, 30, 3, 400)

#Get output grid
image_grid = som.get_centroids()

#Map colours to their closest neurons
mapped = som.map_vects(colors)

plt.title('Color SOM')
for i, m in enumerate(mapped):
    plt.text(m[1], m[0], color_names[i], ha='center', va='center',
             bbox=dict(facecolor='white', alpha=0.5, lw=0))

Here’s a sample of the output you would get (varies each time you train, but the color names should go to the correct locations in the image):


Recursively copying elements from one Graph to another in TensorFlow

In my previous post on Google’s TensorFlow, I had mentioned the idea of using the library for Genetic Programming applications. In my free time, I tried to map out how such an application would be run. The focus obviously wasn’t on building the smartest way to do GP, but rather on exploring if it was practically possible. One of the ideas that stuck out, was to have a Graph for each candidate solution – which would be populated based on cross-overed elements from the parent solutions’ Graphs. This would require a good API to copy computational elements from one Graph to another in TensorFlow. I tried digging around to see if such functionality was available, but couldn’t find any (atleast from a good exploring of the github repo).

I wrote some rudimentary code to accomplish this, and heres a basic outline of how it works:

Consider the example given on TensorFlow’s Get Started page.

import tensorflow as tf
import numpy as np

# Make 100 phony data points in NumPy.
x_data = np.float32(np.random.rand(2, 100)) # Random input
y_data = np.dot([0.100, 0.200], x_data) + 0.300

# Construct a linear model.
b = tf.Variable(tf.zeros([1]))
W = tf.Variable(tf.random_uniform([1, 2], -1.0, 1.0))
y = tf.matmul(W, x_data) + b

# Minimize the squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)

# For initializing the variables.
init = tf.initialize_all_variables()

# Launch the graph
sess = tf.Session()

# Fit the plane.
for step in xrange(0, 201):
    if step % 20 == 0:
        print step, sess.run(W), sess.run(b)

# Learns best fit is W: [[0.100  0.200]], b: [0.300]

Lets say you want to copy the main training element train to another graph as defined here:

to_graph = tf.Graph()


1) You first decide on a namespace inside which all the copied elements will exist in to_graph. This is not really required if the Graph you are copying to is empty. But its important to remember that element names matter a lot in TensorFlow’s workings. Therefore, to avoid naming conflicts, its better to define such a namespace. So basically, what would be called “C” in to_graph‘s namespace, would now be called “N/C” where “N” is the namespace String.

Lets assume the namespace we define the copied elements in our example to, is “CopiedOps”.

namespace = "CopiedOps"


2) You then copy all the variables first, one by one, using a dedicated function copy_variable_to_graph. Each time, you supply the original instance, the target graph (to_graph in this case), the namespace, and an extra dictionary called copied_variables. This dictionary is useful while copying the computational nodes (Operation instances, we will call them ops) in the next step. Since Variable instances act as inputs for ops, we need a way to keep track of them for later.

I initially wanted to combine this initialization of variables with the function that copies the computational elements, but I found it really tricky to capture the appropriate Variable instances based on their connections to ops. Anyways, since variables are more like parameter-storing units whose values are needed frequently, its better to initialize them separately.

The Variable instances in the above example are b and W. Heres how you would do it with my code:

copied_variables = {}
b1 = copy_variable_to_graph(b, to_graph, namespace, copied_variables)
W1 = copy_variable_to_graph(W, to_graph, namespace, copied_variables)

Ofcourse, if your code has a lot of variables, you could just store them all in a list and run the above function over all of them with a common dictionary for copied variables.


3) You then recursively copy all the computational nodes (ops, Placeholders) to the other graph. Now heres the nice things about the method- You only need to do it for the topmost node computational node. All connected inputs and Tensors are automatically taken care of!

For the above example, the train object constructed on line 16 is the ‘topmost’ node. So copying the whole learner is as simple as:

train_copy = copy_to_graph(train, to_graph, copied_variables, namespace)

Thats it! The other instances like y, optimizer, loss are automatically replicated in to_graph.

Theres also a helper function in case you want to find the equivalent of an element from the original graph, in to_graph:

loss_copy = get_copied(loss, to_graph, copied_variables, namespace)


4) You can now run the new node in to_graph. Remember to initialize a new Session instance linked to the graph, and initialize all Variables. So heres how you would go about it:

with to_graph.as_default():
    init1 = tf.initialize_all_variables()
    sess1 = tf.Session()

    for step in xrange(0, 201):
        if step % 20 == 0:
            print step, sess1.run(W1), sess1.run(b1)

This provides an output similar to what you would get from the Get Started original example.

The Code

import tensorflow as tf
from tensorflow.python.framework import ops
from copy import deepcopy

def copy_variable_to_graph(org_instance, to_graph, namespace,
    Copies the Variable instance 'org_instance' into the graph
    'to_graph', under the given namespace.
    The dict 'copied_variables', if provided, will be updated with
    mapping the new variable's name to the instance.

    if not isinstance(org_instance, tf.Variable):
        raise TypeError(str(org_instance) + " is not a Variable")

    #The name of the new variable
    if namespace != '':
        new_name = (namespace + '/' +
        new_name = org_instance.name[:org_instance.name.index(':')]

    #Get the collections that the new instance needs to be added to.
    #The new collections will also be a part of the given namespace,
    #except the special ones required for variable initialization and
    collections = []
    for name, collection in org_instance.graph._collections.items():
        if org_instance in collection:
            if (name == ops.GraphKeys.VARIABLES or
                name == ops.GraphKeys.TRAINABLE_VARIABLES or
                namespace == ''):
                collections.append(namespace + '/' + name)

    #See if its trainable.
    trainable = (org_instance in org_instance.graph.get_collection(
    #Get the initial value
    with org_instance.graph.as_default():
        temp_session = tf.Session()
        init_value = temp_session.run(org_instance.initialized_value())

    #Initialize the new variable
    with to_graph.as_default():
        new_var = tf.Variable(init_value,

    #Add to the copied_variables dict
    copied_variables[new_var.name] = new_var

    return new_var

def copy_to_graph(org_instance, to_graph, copied_variables={}, namespace=""):
    Makes a copy of the Operation/Tensor instance 'org_instance'
    for the graph 'to_graph', recursively. Therefore, all required
    structures linked to org_instance will be automatically copied.
    'copied_variables' should be a dict mapping pertinent copied variable
    names to the copied instances.
    The new instances are automatically inserted into the given 'namespace'.
    If namespace='', it is inserted into the graph's global namespace.
    However, to avoid naming conflicts, its better to provide a namespace.
    If the instance(s) happens to be a part of collection(s), they are
    are added to the appropriate collections in to_graph as well.
    For example, for collection 'C' which the instance happens to be a
    part of, given a namespace 'N', the new instance will be a part of
    'N/C' in to_graph.

    Returns the corresponding instance with respect to to_graph.

    TODO: Order of insertion into collections is not preserved

    #The name of the new instance
    if namespace != '':
        new_name = namespace + '/' + org_instance.name
        new_name = org_instance.name

    #If a variable by the new name already exists, return the
    #correspondng tensor that will act as an input
    if new_name in copied_variables:
        return to_graph.get_tensor_by_name(

    #If an instance of the same name exists, return appropriately
        already_present = to_graph.as_graph_element(new_name,
        return already_present

    #Get the collections that the new instance needs to be added to.
    #The new collections will also be a part of the given namespace.
    collections = []
    for name, collection in org_instance.graph._collections.items():
        if org_instance in collection:
            if namespace == '':
                collections.append(namespace + '/' + name)
    #Take action based on the class of the instance

    if isinstance(org_instance, tf.python.framework.ops.Tensor):

        #If its a Tensor, it is one of the outputs of the underlying
        #op. Therefore, copy the op itself and return the appropriate
        op = org_instance.op
        new_op = copy_to_graph(op, to_graph, copied_variables, namespace)
        output_index = op.outputs.index(org_instance)
        new_tensor = new_op.outputs[output_index]
        #Add to collections if any
        for collection in collections:
            to_graph.add_to_collection(collection, new_tensor)

        return new_tensor

    elif isinstance(org_instance, tf.python.framework.ops.Operation):

        op = org_instance

        #If it has an original_op parameter, copy it
        if op._original_op is not None:
            new_original_op = copy_to_graph(op._original_op, to_graph,
                                            copied_variables, namespace)
            new_original_op = None

        #If it has control inputs, call this function recursively on each.
        new_control_inputs = [copy_to_graph(x, to_graph, copied_variables,
                              for x in op.control_inputs]

        #If it has inputs, call this function recursively on each.
        new_inputs = [copy_to_graph(x, to_graph, copied_variables,
                      for x in op.inputs]

        #Make a new node_def based on that of the original.
        #An instance of tensorflow.core.framework.graph_pb2.NodeDef, it
        #stores String-based info such as name, device and type of the op.
        #Unique to every Operation instance.
        new_node_def = deepcopy(op._node_def)
        #Change the name
        new_node_def.name = new_name

        #Copy the other inputs needed for initialization
        output_types = op._output_types[:]
        input_types = op._input_types[:]

        #Make a copy of the op_def too.
        #Its unique to every _type_ of Operation.
        op_def = deepcopy(op._op_def)

        #Initialize a new Operation instance
        new_op = tf.python.framework.ops.Operation(new_node_def,
        #Use Graph's hidden methods to add the op
        for device_function in reversed(to_graph._device_function_stack):

        return new_op

        raise TypeError("Could not copy instance: " + str(org_instance))

def get_copied(original, graph, copied_variables={}, namespace=""):
    Get a copy of the instance 'original', present in 'graph', under
    the given 'namespace'.
    'copied_variables' is a dict mapping pertinent variable names to the
    copy instances.

    #The name of the copied instance
    if namespace != '':
        new_name = namespace + '/' + original.name
        new_name = original.name

    #If a variable by the name already exists, return it
    if new_name in copied_variables:
        return copied_variables[new_name]

    return graph.as_graph_element(new_name, allow_tensor=True,

Working with feeding is pretty simple too:

>>> x = tf.placeholder("float")
>>> a = tf.constant(3, "float")
>>> y = tf.add(x, a)
>>> namespace = "CopiedOps"
>>> to_graph = tf.Graph()
>>> copied_variables = {}
>>> y1 = copy_to_graph(y, to_graph, namespace)
>>> x1 = get_copied(x, to_graph, namespace)
>>> with to_graph.as_default():
    sess = tf.Session()
    print sess.run(y1, feed_dict={x1: 5})


I guess thats all for my hacking around with TensorFlow for the week. If you intend to use this code, please note that it may not be perfect at doing what it says. I haven’t tried it out with all sorts of TensorFlow data structures as yet, so be open to getting an Exception or two that you may have to fix. Infact, do drop me a comment or mail so I can make this code as fool-proof as I can. Cheers!