# A (small) introduction to Boosting

#### What is Boosting?

Boosting is a machine learning meta-algorithm that aims to iteratively build an ensemble of weak learners, in an attempt to generate a strong overall model.

Lets look at the highlighted parts one-by-one:

1. Weak Learners: A ‘weak learner’ is any ML algorithm (for regression/classification) that provides an accuracy slightly better than random guessing. For example, consider a problem of binary classification with approximately 50% of samples belonging to each class. Random guessing in this case would yield an accuracy of around 50%. So a weak learner would be any algorithm, however simple, that slightly improves this score – say 51-55% or more. Usually, weak learners are pretty basic in nature. Some examples are Decision Stumps or smaller Decision Trees. In most cases, either one of these two is used as a weak learner in Boosting frameworks.

2. Ensemble: The overall model built by Boosting is a weighted sum of all of the weak learners. The weights and training given to each ensures that the overall model yields a pretty high accuracy (sometimes state-of-the-art).

3. Iteratively build: Many ensemble methods such as bagging/random forests are capable of training their components in parallel. This is because the training of any of those weak learners does not depend on the training of the others. That is not the case with Boosting. At each step, Boosting tries to evaluate the shortcomings of the overall model built till the previous step, and then generates a weak learner to battle these shortcomings. This weak learner is then added to the total model. Therefore, the training must necessarily proceed in a serial/iterative manner.

4. Meta-algorithm: Since Boosting isn’t necessarily an ML algorithm by itself, but rather uses other (basic) algorithms to build a stronger one, it is said to be a ‘meta’ algorithm.

#### How does Boosting work?

Usually, a Boosting framework for regression works as follows:

(The overall model at step ‘i’ is denoted by $F_i$.)

1. Start with a simple model $F_0$, usually predicting a common (average) value for all instances.

2. For $i$ from $1$ to $M$ do:

2.a) Evaluate the shortcomings of $F_{i - 1}$, in terms of a value $r_j$ for each element $x_j$ in the training set.
2.b) Generate a new weak learner $h_i(x)$ based on the $r_j$s.
2.c) Compute the corresponding weight $\gamma_i$.
2.d) Update the current model as follows:

$F_i = F_{i - 1} + \nu \gamma_i h_i(x)$

where $\nu$ denotes the learning rate of the framework.

3. Return $F_M$ as the overall output.

Each of the iterations in Boosting essentially tries to ‘improve’ the current model by introducing another learner into the ensemble. Having such an ensemble not only reduces the bias (which is generally pretty high for weak learners), but also the variance (since multiple learners contribute to the overall output, each with their own unique training). There are various versions of Boosting, and all of them primarily differ in the way they perform steps 2.a, 2.b. and 2.c.

For example, Gradient Boosting computes $r_j$ as the gradient of a Loss function (whose expression involves target output $y$ and model output till previous step, $F_{i - 1}$) with respect to $F_{i - 1}$ at data point $x_j$. The $i$th weak learner $h_i$ is then trained to predict $r_j$ from $x_j$. The weight $\gamma_i$ is then optimized so as to minimize the total Loss value with respect to $F_i$.

AdaBoost, short for Adaptive Boosting, works by assigning $r_j$ as a ‘weight’ to $x_j$ based on how miscalculated the model output is according to $F_{i - 1}$. $h_i$ and $\gamma_i$ are then trained to optimize the weighted sum of errors with respect to the output from $h_i$. AdaBoost is considered to be one of the best out-of-the-box classifiers today. Mathematically speaking, AdaBoost is a special case of Gradient Boosting.

The weak-learners used in Boosting are indeed ‘weak’. The most commonly used ones are decision trees of a fixed number of leaf nodes. Using complex methods such as SVMs in Boosting usually leads to overfitting. However, the total number of weak learners $M$ should also be controlled, mostly based on experimentation. Having a low $M$ will cause underfitting, but a very high value could lead to overfitting.

To improve the bias/variance characteristics of Boosting, bagging is generally employed. What this essentially does is train each weak learner on a subset of the overall dataset (rather than the whole training set). This causes a good decrease in variance and tends to reduce overfitting.

Classification applications would work slightly different from regression, of course. The most commonly used method is to build a boosted model for each class. This model would predict the probability of a data point belonging to that class. During training, the inputs belonging to the corresponding class will be given a target of 1, with the others getting a target of 0. The outputs from all the models could then be put into a softmax function for normalization.

#### What are the drawbacks of Boosting?

The most obvious drawback of boosting is the computational complexity. Performing so many iterations, and generating a new model at each, requires a lot of computations and time (and also space). The fact that the ensemble has to be built iteratively doesn’t help matters. However, using simple weak learners (instead of big decision trees) helps to mitigate this problem.

The biggest criticism for Boosting comes from its sensitivity to noisy data. Think of it this way. At each iteration, Boosting tries to improve the output for data points that haven’t been predicted well till now. If your dataset happens to have some (or a lot) of misclassified/outlier points, the meta-algorithm will try very hard to fit subsequent weak learners to these noisy samples. As a result, overfitting is likely to occur. The exponential loss function used in AdaBoost is particularly vulnerable to this issue (since the error from an outlier will get an exponentially-weighed importance for future learners).

# Simple Turing Machine using TensorFlow

A post after many days! This one describes a simple code I wrote to simulate a Deterministic Turing Machine using TensorFlow. Its pretty basic with respect to the design, using TensorFlow only for the state change computations. The movement on the tape, and handling of halts is taken care of by old-school Python. This is not intended to be a practical usage of TensorFlow, its just something I did and am writing about here :-).

To demonstrate the code, I will use the 3-state Busy Beaver example from the Wiki page.

#### Encoding the Problem

Each of the states are encoded as real numbers (integers in this case, but for whatever reasons you could use floats too). So ‘A’, ‘B’, ‘C’ and ‘HALT’ become 0, 1, 2 and 3 respectively. Accordingly, the initial and final states are defined as follows:

initial_state = 0
final_states = [3]


Since there can be multiple final states generally, final_states is defined as a list. As with the states, the potential symbols are numbers too. The blank symbol, as per the Busy Beaver problem definition, is 0.

blank_input = 0


Now we come to the transition function. It is defined using two 2-D NumPy matrices, one for the inputs and the other for the corresponding outputs.

import numpy as np

transition_input = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1],
[2, 0],
[2, 1]])

transition_output = np.array([[1, 1, 1],
[2, 1, -1],
[0, 1, -1],
[1, 1, 1],
[1, 1, -1],
[3, 1, 1]])


Consider the transition_input[2] and transition_output[2]. The input denotes [current state, current symbol]. The output denotes [next state, next symbol, movement]. The movement value can either be 1(move pointer right, or increment index value) or -1(move pointer left, or decrement index value). Accordingly, transition_input[2] requires the current state to be 1 and current symbol being read to be 0. If these conditions are met, then the next state and next symbol are 0 and 1 respectively, with the pointer on the tape moving one step to the left.

#### The Turing Machine Code

The actual Turing Machine code follows. Its well documented as always, so you shouldn’t have a problem following whats going on if you are well-acquainted with Turing Machines and TensorFlow.

import tensorflow as tf

class TuringMachine(object):
"""
Implementation of a basic Turing Machine using TensorFlow.
"""

def __init__(self, initial_state, final_states, blank_input,
transition_input, transition_output):
"""
Initializes the TensorFlow Graph required for running the Turing
Machine.

States and symbols should both be encoded as real numbers prior to
initialization. Accordingly, 'initial_state' should be a float
denoting the starting state. 'final_states' should be an iterable
of numbers, denoting all the states the Machine can halt at.
'blank_input' should be a float denoting the blank input for the
Machine.
'transition_input' and 'transition_output' will together denote
the transition function followed by the Machine. Both should be
2-D NumPy matrices. Each inner vector in 'transition_input' will
contain [current state, current symbol], while each inner vector
in 'transition_output' will have [next state, next symbol, move].
"move" can either be 1: to increase pointer index by 1 (move
right), or -1: decrease pointer index by 1 (move left).

"""

#Store the necessary information as attributes
self._initial_state = float(initial_state)
self._final_states = set(final_states)
self._blank_input = blank_input

#Initialize Graph
self._graph = tf.Graph()

n_transitions = len(transition_input)
if len(transition_input) != len(transition_output):
raise ValueError("Number of transition inputs and outputs " +
"not same")

#Initialize all components in the Graph
with self._graph.as_default():

#Placeholder for current symbol thats being read from the
#tape
self._current_symbol = tf.placeholder("float")

#Variable to hold current state
self._current_state = tf.Variable(self._initial_state)

#Constant Ops for transitions
transition_in = tf.constant(transition_input)
transition_out = tf.constant(transition_output)

#Generate Ops to compute transition
concat_in = tf.pack([self._current_state,
self._current_symbol])
packed_in = tf.pack([concat_in for i in
range(len(transition_input))])
transition_diff = tf.reduce_sum(
tf.pow(tf.sub(packed_in, transition_input), 2), 1)
#This will return a 0 if theres an accurate match for the
#transition condition (current state, current symbol). If not,
#this will return a non-zero value.
self._transition_match = tf.reduce_min(transition_diff, 0)
#This will correspond to the index of the match
match_index = tf.cast(tf.argmin(transition_diff, 0), "int32")
#Pick out the transition output
self._transition = tf.reshape(
tf.slice(transition_out,  tf.pack([match_index, 0]),
tf.pack([1, 3])), [3])
#This will change state Variable
self._state_change = tf.assign(
self._current_state, tf.reshape(
tf.cast(tf.slice(self._transition, tf.constant([0]),
tf.constant([1])), "float32"), []))

#This will reset the state Variable
self._state_reset = tf.assign(self._current_state,
self._initial_state)

#Initialize Session
self._sess = tf.Session()

#Initialize Variables
init_op = tf.initialize_all_variables()
self._sess.run(init_op)

def run(self, input_tape, start_index, max_iter=1000):
"""
Runs the Turing Machine with the given input.
'input_tape' must be an index-able object, with 'start_index'
denoting the index number to start computation at.

Obviously, beware of inputs that may cause the machine to run
indefinitely. To control this, set 'max_iter' to some number of
iterations you would want to allow.

Will return final state and final index. If Machine halts at a
non-final state, or goes beyond max_iter, i.e. the Machine does
not accept the input, will return None, None.
The input tape will be modified in-place.
"""

#First reset the state
self._sess.run(self._state_reset)

#Run the Machine
index = start_index
for i in range(max_iter):
#Figure out input to be given
if (index > -1 and len(input_tape) > index):
machine_input = input_tape[index]
else:
machine_input = self._blank_input

#Run one step of the Machine
_, match_value, transition = self._sess.run(
[self._state_change, self._transition_match,
self._transition],
feed_dict = {self._current_symbol: machine_input})

#If match_value is not zero, return None, None
if match_value != 0:
return None, None

#First modify tape contents at current index as per transition
if (index > -1 and len(input_tape) > index):
#Index is within input tape limits
input_tape[index] = transition[1]
elif index == len(input_tape):
#Index has gone beyond the range of the input tape,
#to the right
input_tape.append(transition[1])
else:
#Index has gone beyond the range of the input tape,
#to the left
input_tape.insert(0, transition[1])
index = 0

#Change index as per transition
index = index + transition[2]

#If its a final state, return state, index
if transition[0] in self._final_states:
return transition[0], index

return None, None



A few things:

1. If the tape ‘pointer’ goes beyond the limits of input_tape, the blank input is automatically given to the machine (lines 114-118). Moreover,  depending on which end of the tape the pointer has gone across, the output value from the machine is inserted into the list (lines 130-142).

2. Since there is no way to check if the Machine will stop, the run method lets the user provide a maximum number of iterations. If the machine keeps running beyond the limit, (None, None) is returned.

3. The input_tape list is modified in-place.

4. If you want to track the transitions the Turing Machine follows, add a ‘print’ statement (or some kind of logging) at line 125.

#### Running the Turing Machine

First we define the contents of the input tape, and the index to start at.

>>> input_tape = [0, 1, 1, 1, 1, 1]
>>> start_index = 0


Then, we initialize a TuringMachine instance with the algorithm as encoded above:

>>> machine = TuringMachine(initial_state, final_states, blank_input,
transition_input, transition_output)


Finally, we ‘run’ the machine, and check if the input tape has been modified correctly:

>>> machine.run(input_tape, start_index)
(3, 5)
>>> input_tape
[1, 1, 1, 1, 1, 1, 1]
>>> machine.run(input_tape, start_index)
(3, 9)
>>> input_tape
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Thats it! This is my first implementation of a Turing Machine, so do leave a comment if there are any errors/improvements possible. Cheers!

# Predicting Trigonometric Waves few steps ahead with LSTMs in TensorFlow

I have recently been revisiting my study of Deep Learning, and I thought of doing some experiments with Wave prediction using LSTMs. This is nothing new, just more of a log of some tinkering done using TensorFlow.

### The Problem

The basic input to the model is a 2-D vector – each number corresponding to the value attained by the corresponding wave. Each wave in turn is: (a constant + a sine wave + a cosine wave). The waves themselves have different magnitudes, initial phases and frequencies. The goal is to predict the values that will be attained a certain (I chose 23) steps ahead on the curve.

So first off, heres the wave-generation code:


##Producing Training/Testing inputs+output
from numpy import array, sin, cos, pi
from random import random

#Random initial angles
angle1 = random()
angle2 = random()

#The total 2*pi cycle would be divided into 'frequency'
#number of steps
frequency1 = 300
frequency2 = 200
#This defines how many steps ahead we are trying to predict
lag = 23

def get_sample():
"""
Returns a [[sin value, cos value]] input.
"""
global angle1, angle2
angle1 += 2*pi/float(frequency1)
angle2 += 2*pi/float(frequency2)
angle1 %= 2*pi
angle2 %= 2*pi
return array([array([
5 + 5*sin(angle1) + 10*cos(angle2),
7 + 7*sin(angle2) + 14*cos(angle1)])])

sliding_window = []

for i in range(lag - 1):
sliding_window.append(get_sample())

def get_pair():
"""
Returns an (current, later) pair, where 'later' is 'lag'
steps ahead of the 'current' on the wave(s) as defined by the
frequency.
"""

global sliding_window
sliding_window.append(get_sample())
input_value = sliding_window[0]
output_value = sliding_window[-1]
sliding_window = sliding_window[1:]
return input_value, output_value



Essentially, you just need to call get_pair to get an ‘input, output’ pair – the output being 23 time intervals ahead on the curve. Each have the NumPy dimensionality of [1, 2]. The first value ‘1’ means that the batch size is 1 – we will feed one input at a time while training/testing.

Now, I don’t pass the input directly into the LSTM. I try to improve the LSTM’s understanding of the input, by providing its first and second derivatives as well. So, if the input at time t is x(t), the derivative is x'(t) = (x(t) – x(t-1)). Following the analogy, x”(t) = (x'(t) – x'(t-1)). Here’s the code for that:


#Input Params
input_dim = 2

#To maintain state
last_value = array([0 for i in range(input_dim)])
last_derivative = array([0 for i in range(input_dim)])

def get_total_input_output():
"""
Returns the overall Input and Output as required by the model.
The input is a concatenation of the wave values, their first and
second derivatives.
"""
global last_value, last_derivative
raw_i, raw_o = get_pair()
raw_i = raw_i[0]
l1 = list(raw_i)
derivative = raw_i - last_value
l2 = list(derivative)
last_value = raw_i
l3 = list(derivative - last_derivative)
last_derivative = derivative
return array([l1 + l2 + l3]), raw_o



So the overall input to the model becomes a concatenated version of x(t), x'(t), x”(t). The obvious question to ask would be- Why not do this in the TensorFlow Graph itself? I did try it, and for some reason (which I don’t understand yet), there seems to seep in some noise into the Variables that act as memory units to maintain state.

But anyways, here’s the code for that too:


#Imports
import tensorflow as tf
from tensorflow.models.rnn.rnn import *

#Input Params
input_dim = 2

##The Input Layer as a Placeholder
#Since we will provide data sequentially, the 'batch size'
#is 1.
input_layer = tf.placeholder(tf.float32, [1, input_dim])

##First Order Derivative Layer
#This will store the last recorded value
last_value1 = tf.Variable(tf.zeros([1, input_dim]))
#Subtract last value from current
sub_value1 = tf.sub(input_layer, last_value1)
#Update last recorded value
last_assign_op1 = last_value1.assign(input_layer)

##Second Order Derivative Layer
#This will store the last recorded derivative
last_value2 = tf.Variable(tf.zeros([1, input_dim]))
#Subtract last value from current
sub_value2 = tf.sub(sub_value1, last_value2)
#Update last recorded value
last_assign_op2 = last_value2.assign(sub_value1)

##Overall input to the LSTM
#x and its first and second order derivatives as outputs of
#earlier layers
zero_order = last_assign_op1
first_order = last_assign_op2
second_order = sub_value2
#Concatenated
total_input = tf.concat(1, [zero_order, first_order, second_order])



If you have an idea of what might be going wrong, do leave a comment! In any case, the core model follows.

### The Model

So heres the the TensorFlow model:

1) The Imports:


#Imports
import tensorflow as tf
from tensorflow.models.rnn.rnn import *



2) Our input layer, as always, will be a Placeholder instance with the appropriate type and dimensions:


#Input Params
input_dim = 2

##The Input Layer as a Placeholder
#Since we will provide data sequentially, the 'batch size'
#is 1.
input_layer = tf.placeholder(tf.float32, [1, input_dim*3])



3) We then define out LSTM layer. If you are new to Recurrent Neural Networks or LSTMs, here are two excellent resources:

1. This blog post by Christopher Olah
2. This deeplearning.net post. It defines the math behind the LSTM cell pretty succinctly.

If you like to see implementation-level details too, then heres the relevant portion of the TensorFlow source for you.

Now the LSTM layer:


##The LSTM Layer-1
#The LSTM Cell initialization
lstm_layer1 = rnn_cell.BasicLSTMCell(input_dim*3)
#The LSTM state as a Variable initialized to zeroes
lstm_state1 = tf.Variable(tf.zeros([1, lstm_layer1.state_size]))
#Connect the input layer and initial LSTM state to the LSTM cell
lstm_output1, lstm_state_output1 = lstm_layer1(input_layer, lstm_state1,
scope=&quot;LSTM1&quot;)
#The LSTM state will get updated
lstm_update_op1 = lstm_state1.assign(lstm_state_output1)



We only use 1 LSTM layer. Providing a scope to the LSTM layer call (on line 8) helps in avoiding variable-scope conflicts if you have multiple LSTM layers.

The LSTM layer is followed by a simple linear regression layer, whose output becomes the final output.


##The Regression-Output Layer1
#The Weights and Biases matrices first
output_W1 = tf.Variable(tf.truncated_normal([input_dim*3, input_dim]))
output_b1 = tf.Variable(tf.zeros([input_dim]))
#Compute the output
final_output = tf.matmul(lstm_output1, output_W1) + output_b1



We have finished defining the model itself. But now, we need to initialize the training components. These help fine-tune the parameters/state of the model to make it ready for deployment. We won’t be using these components post training (ideally).

4) First, a Placeholder for the correct output associated with the input:


##Input for correct output (for training)
correct_output = tf.placeholder(tf.float32, [1, input_dim])



Then, the error will be computed using the LSTM output and the correct output as the Sum-of-Squares loss.


##Calculate the Sum-of-Squares Error
error = tf.pow(tf.sub(final_output, correct_output), 2)



Finally, we initialize an Optimizer to adjust the weights for the LSTM layer. I tried Gradient Descent, RMSProp as well as Adam Optimization. Adam works best for this model. Gradient Descent works really bad on LSTMs for some reason (that I can’t grasp right now). If you want to read more about Adam-Optimization, read this paper. I decided on the learning rate of 0.0006 after a lot of trial-and-error, and it seems to work best for the number of iterations I use (100k).


##The Optimizer



5) Finally, we initialize the Session and all required Variables as always.


##Session
sess = tf.Session()
#Initialize all Variables
sess.run(tf.initialize_all_variables())



The Training

Here’s the rudimentary code I used for training the model:


##Training

actual_output1 = []
actual_output2 = []
network_output1 = []
network_output2 = []
x_axis = []

for i in range(80000):
input_v, output_v = get_total_input_output()
_, _, network_output = sess.run([lstm_update_op1,
train_step,
final_output],
feed_dict = {
input_layer: input_v,
correct_output: output_v})

actual_output1.append(output_v[0][0])
actual_output2.append(output_v[0][1])
network_output1.append(network_output[0][0])
network_output2.append(network_output[0][1])
x_axis.append(i)

import matplotlib.pyplot as plt
plt.plot(x_axis, network_output1, 'r-', x_axis, actual_output1, 'b-')
plt.show()
plt.plot(x_axis, network_output2, 'r-', x_axis, actual_output2, 'b-')
plt.show()



Training takes almost a minute on my Intel i5 machine.

Consider the first wave. Initially, the network output is far from the correct one(The red one is the LSTM output):

But by the end, it fits pretty well:

Similar trends are seen for the second wave:

### Testing

In practical scenarios, the state at which you end training would rarely be the state at which you deploy. Therefore, prior to testing, I ‘fastforward’ both the waves first. Then, I flush the contents of the LSTM cell (mind you, the learned matrix parameters for the individual functions don’t change).


##Testing

for i in range(200):
get_total_input_output()

#Flush LSTM state
sess.run(lstm_state1.assign(tf.zeros([1, lstm_layer1.state_size])))



And here’s the rest of the testing code:


actual_output1 = []
actual_output2 = []
network_output1 = []
network_output2 = []
x_axis = []

for i in range(1000):
input_v, output_v = get_total_input_output()
_, network_output = sess.run([lstm_update_op1,
final_output],
feed_dict = {
input_layer: input_v,
correct_output: output_v})

actual_output1.append(output_v[0][0])
actual_output2.append(output_v[0][1])
network_output1.append(network_output[0][0])
network_output2.append(network_output[0][1])
x_axis.append(i)

import matplotlib.pyplot as plt
plt.plot(x_axis, network_output1, 'r-', x_axis, actual_output1, 'b-')
plt.show()
plt.plot(x_axis, network_output2, 'r-', x_axis, actual_output2, 'b-')
plt.show()



Its pretty similar to the training one, except for one small difference: I don’t run the training op anymore. Therefore, those components of the Graph don’t work at all.

Here’s the correct output with the model’s output for the first wave:

And the second wave:

Thats all for now! I am not a deep learning expert, and I still experimenting with RNNs, so do leave comments/suggestions if you have any! Cheers!

The Nelder-Mead method is a heuristic optimization technique, just like Genetic Algorithms or Particle Swarms. Optimization in this context refers to the problem of finding point(s) with the optimal value of an objective function in a search space. Optimal could mean a maximum or minimum value.

Whenever possible, we try to solve an optimization problem with an analytical method. For example, if you have the exact definition of $f(x)$ as a differentiable function of $x$, you could use simple derivatives to compute the value of $x$ at which $f$ attains the optimal value. However, in many real-world problems, we don’t have such a crystal-clear notion of the function we are optimizing. A simple example would be the value of a stock market portfolio as a function of the stocks (and their weightages) in the portfolio, and time. Here, you have no way to define how the value of the overall portfolio is going to behave, and therefore analytical methods don’t make sense.

Heuristic optimization techniques try to remedy this by having a degree of randomness to them. This means that the same algorithm run with the same problem and the same initial conditions could yield different results with each run. Heuristic methods explore the search space stochastically, and keep a note of the ‘best’/most optimal points they come across on the way. Of course, the behaviour is not totally random, but guided by a set of rules.

The most important thing to remember is that heuristic methods never guarantee that they will find the most optimal solution. They give as an output the best point they came across during their search – which might have been a local optimum for all you know. But in practical scenarios, we usually concern ourselves with finding good solutions anyways- we don’t need them to be the best. In fact, finding the best solution with absolute certainty might be impossible/impractical in most cases (think of the stock market portfolio example again).

The Nelder-Mead method uses a geometrical shape called a simplex as its ‘vehicle’ of sorts to search the domain. This is why the technique is also called the Simplex search method. In layman’s terms, a simplex is the n-dimensional version of a ‘triangle’.

For 1 dimension, its a line.

For 2 dimensions, its a triangle.

For 3 dimensions, its a tetrahedron.

And so on…

The shape doesn’t have to be symmetrical/equilateral in any way. Note that an n-dimensional simplex has (n+1) vertexes.

The Algorithm

Nelder-Mead starts off with a randomly-generated simplex (We will discuss how, later). At every iteration, it proceeds to reshape/move this simplex, one vertex at a time, towards an optimal region in the search space. During each step, it basically tries out one or a few modifications to the current simplex, and chooses one that shifts it towards a ‘better’ region of the domain. In an ideal case, the last few iterations of this algorithm would involve the simplex shrinking inwards towards the best point inside it. At the end, the vertex of the simplex that yields that most optimal objective value, is returned.

The following video of a 2-D Nelder-Mead optimization should give you an intuitive understanding of what happens:

How does it work?

Lets assume we are dealing with an $n$-dimensional space. The current simplex consists of the following $n + 1$ points: $x_1$, $x_2$, …, $x_{(n + 1)}$. The function we are trying to optimize(maximize) is $f(x)$. The algorithm takes the following steps during every iteration:

I. Ordering

All the points are ordered/sorted, such that the value of $f$ for the first point is the lowest, and that for the last point is the highest. Let the indices of the first(worst), second(second worst) and last(best) points be $h, s, l$ respectively.

II. Computing the Centroid

Consider all points except the worst ($x_h$). Compute their centroid(mean) as:

$c = \frac{1}{n} \sum_{i \neq h} x_i$

III. Transformation

This is the most important part of the whole process – transforming the simplex. Transformation takes place in one of the following ways:

i. Reflection

This is the first type of transformation that is tried out. You compute the ‘reflected’ point as:

$x_r = c + \alpha (c - x_h)$

$\alpha$ is called the reflection parameter, and is usually equal to 1. $x_r$ is essentially a point on the line joining $c$ and $x_h$, but located away from $x_h$. This is done in an attempt to move the simplex in a direction thats away from the sub-optimal region around $x_h$. For a 2-D problem, heres what reflection looks like:

Now, if $f(x_s) < f(x_r) \leq f(x_l)$ – that is, that $x_r$ is better than the second-worst point, but not better than the current best, we just replace $x_h$ with $x_r$ in the simplex, and move on to the next iteration.

ii. Expansion

If our luck is great and the reflected point $x_r$ happens to be better than the current best ($f(x_r) > f(x_l)$), we try to explore the possibility further. In other words, we move a little bit more in the direction of $x_r$ from $c$, in the hope of finding an even better solution. The expanded point is defined as:

$x_e = c + \gamma (x_r - c)$

$\gamma$ is called the expansion parameter, and its value in most implementations is 2. Heres what 2-D Expansion looks like:

We then replace $x_h$ with the better of the two points: $x_e$ and $x_r$ in the simplex. This is called ‘greedy optimization‘ – we have replaced the worst point with the better of the two options we had. Theres another variant called ‘greedy expansion‘, which replaces $x_h$ with $x_e$, as long as $f(x_e) > f(x_l)$. This is done irrespective of whether $x_e$ is better than $x_r$. The intuition for this, comes from the fact that Expansion always leads to a bigger simplex, which means more exploration of the search space.

iii. Contraction

Suppose our reflection point was worst than $x_s$ (i.e. the second worst point). In that case, we need to realise that the direction defined by $x_r$ may not be the ideal one for moving. Therefore, we end up contracting our simplex.

The contraction point $x_c$ is defined as:

$x_c = c + \beta (x_h - c)$

$\beta$ is called the contraction parameter, and is usually equal to 0.5. In essence, we try to go against our exploration that we tried with $x_r$, instead contracting the simplex ‘inwards’. A 2-D contraction looks like this:

If $f(x_c) > f(x_h)$, which means that the contracted point is better than the current-worst, then we replace $x_h$ with $x_c$ in the simplex. However, if this is not the case, then we go to our last resort: the Shrink contraction.

iv. Shrink contraction

In this case, we redefine the entire simplex. We only keep the best point ($x_l$), and define all others with respect to it and the previous points. The $j$th new point, will now be defined as:

$x_j = x_l + \delta (x_j - x_l)$

$\delta$ is called the shrinkage parameter, and is equal to 0.5 in most implementations. What we are essentially doing with the above definition, is moving each point in the simplex towards the current best point, in the hope of converging onto the best neighbourhood. A 2-D Shrinking transformation looks like:

It is important to note that this transformation is the most expensive of all the transformations, since we have to replace multiple (all but one) points in the simplex. However, it has been found (after multiple experiments), that the shrink transformation rarely happens in practice. As a result, many implementations of Nelder-Mead simply skip this step, doing the Contraction instead.

Termination

Termination is usually reached when:

1. A given number of iterations is reached

2. The simplex reaches some minimum ‘size’ limit.

3. The current best solution reaches some favorable limit.

The Initial Simplex

Like in the case of most heuristic techniques, theres no ‘best’ way to do the random initialization for Simplex search. However, heres one that supposedly works well (it is used in the MATLAB implementation of Nelder-Mead):

The initial point, $x1$, is provided by the user if he has some idea about a possible good solution. If not, its taken to be some random point. This is called the ‘guess‘ vertex. Now, remember we have n dimensions, and we need (n + 1) points – out of which 1 has already been decided. The other end points are initialized with respect to $x_1$, at a small distance along the direction of each dimension.

The (i + 1)th point $x_{(i + 1)}$, as defined using the $i$th dimension unit vector $u_i$, is given by:

$x_{(i + 1)} = x_1 + h(x_1, i) * u_i$

$h(x_1, i)$ is defined as:

$h(x_1, i) = 0.05$ if the coefficient of $u_i$ in the definition of $x_1$ is non-zero.

$h(x_1, i) = 0.00025$ if the coefficient of $u_i$ in the definition of $x_1$ is zero.

Other Points

1. Nelder-Mead is generally used for parameter selection in Machine Learning. In essence, data scientists use the Simplex Method to figure out the optimum parameters for their models. This is mainly because this method tends to optimize the objective function fairly fast, and in an efficient way (especially in implementations where Shrinking is not used).

2. Even though Nelder-Mead tends to optimize the objective fairly fast (with few iterations), it tends to get stuck in local optima. In such cases, it may remain stuck in a sub-optimal neighbourhood for a long time, and the only solution to this is to restart the algorithm. Why this happens is fairly easy to understand. Once the simplex enters a region of local optimum, it will only keep contracting/shrinking, instead of randomly exploring other(far-away) parts of the search space. As a result, its becomes impossible for the algorithm to explore any further.

References

# Redis + Celery: Reactive Computing in Django for IoT applications

To quote Wikipedia, Reactive Programming is a programming paradigm oriented around data flows and the propagation of change.

Consider the simple equation $a = b + c$. This relationship defines a flow of data, in that it describes how the data for $a$ is computed given the data for $b$ and $c$. Suppose the values of $b$ and $c$ at time t=0 are 4 and 5 respectively. Therefore, $a_{t=0} = 9$. Now, lets say that the value of $b$ changes at t=1 so that $b_{t=1} = 10$. By the definition of $a$ mentioned before, it follows that $a_{t=1}=15$. Thus, the change in the value of $b$ got propagated, causing the value of its ‘parent’ $a$ to change. I use the word parent because if we were to draw the relationship between $a$, $b$ and $c$ as a tree, $a$ be the parent of $b$ and $c$ (like in a data flow graph). A common example of reactive programming is seen in spreadsheet software like MS-Excel, where the change in the value of a particular cell causes a change in the values of cells that depend on it by some pre-defined equation.

RxJS and Meteor are two javascript-based frameworks that implement reactive programming for web-based applications.

Reactivity, in general, is a paradigm that’s difficult to find in Python-based systems, and that includes Django. This post describes a method to perform mathematical computations (not reactive programming in general, just mathematical calculations) in a reactive manner in Django-based systems. I will soon describe the exact problem solved by my code, with suggestions on how to modify it to suit your needs.

Prerequisites

Heres the prerequisites, apart from Django:

i. We will use Celery to perform tasks asynchronously.

ii. We will use Redis as for temporary data storage, and synchronization.

This is a good place to look if you want to set up both.

iii. We will use python-redis-lock to use Redis for locking mechanisms. Locking ensures that one data update/read doesn’t get messed up due to another update that’s running in parallel.

The ‘Problem Statement’

Consider the two equations:

$A = C + D + E$ …(1)

$B = A * F$ …(2)

We will refer to $A$, $B$, $C$, $D$, $E$ and $F$ as computational nodes. $C$, $D$, $E$ and $F$ are input values that we acquire from certain sources of data (whatever they might be). They are essentially the independent variables in this context. $A$‘s value is derived $C$, $D$ and $E$ (making it their ‘parent’ node). You can also say that $C$, $D$ and $E$ are the arguments for $A$. $B$ in turn has two dependencies, $A$ and another independent variable $F$.

You don’t know when you will receive the values of the independent variables. You also don’t know what order they will arrive in. Some of them may come together, some very frequently, etc. etc. Lets say that your application dictates that the value of any node remains valid for about a minute its received or computed. Whenever you have had the latest values of $C$, $D$ and $E$ come within the past 60 seconds, you would want to compute the value of $A$. The same logic applies to $B$.

Once the latest value of $A$ is computed, you would like the node to ‘forget’ the values of its arguments(children). Therefore, whenever comes the next time that we have fresh values of $C$, $D$ and $E$ together, $A$ will be calculated again. The same logic once again applies to $B$. The code can be easily modified (and I will mention how) so that this restriction is lifted. In that case, $A$ would be recomputed whenever the value of any of $C$, $D$ and $E$ changes, provided that the other values are still valid.

Whenever you receive/compute the value of a computational node (whether independent or derived), you would want to store its value – along with the appropriate timestamp – in your database.

Now lets see how this data-flow logic can be implemented in a flexible, efficient manner.

The Implementation

The whole functionality will be enclosed in a Django App called ‘reactive_compute’. It will have the following files:

i. models.py

iii. db.py

iv. functions.py

v. functions.py

and obviously

vi. __init__.py

First of all, we will need to define the derived nodes $A$ and $B$, in terms of their Function and Number of Arguments. For example, $A$‘s function can be described as an Addition, with the number of arguments being 3. $B$ on the other hand is a Multiplication, with 2 arguments. To remember this, we will first need to define the database model in models.py. You don’t store this information in Redis, since its ‘permanent’.

The model:

from django.db import models

class FunctionType(models.Model):
"""
Stores information about the type of function/computation
associated with a node.
The function name should have a corresponding mapping in
reactivecompute.functions.Functions.mapping
"""

node = models.CharField('Node', max_length=25)
functionname = models.CharField('Function Name', max_length=25)
noofargs = models.IntegerField('Number of Arguments')


As you can see, $functionname$ is a String that will act as an identifier for the Function. We will have “$ADD$” for Addition, and “$MUL$” for Multiplication. But there must be someplace to implement those functions. Thats where the functions.py files comes in.


##File with all computational functions

"""
"""
return sum(args)

def sub(args):
"""
Subtraction Function.
"""
return (args[0] - args[1])

def mul(args):
"""
Multiplication function.
"""
prod = 1
for arg in args:
prod *= arg
return prod

class Functions(object):
"""
Class to store mapping between function name and function
implementation.
"""

mapping = {
'SUB': sub,
'MUL': mul
}



Adding the mapping as a dictionary to the $Functions$ class allows easy importing. Each of the functions takes in the arguments(in appropriate order) as a list.

The following screenshot shows the appropriate database entries made for $A$ and $B$ in the Django Admin.

Now that we have defined the ‘nodes’ we need to direct the appropriate args to them. This is done by creating another model in models.py, for storing the data-flow information as tree ‘edges’. Every child is an argument for the parent.


class Dependency(models.Model):
"""
Models a Parent-Child relationship between computational
nodes. Also defines the order in which the children are to be
arranged, to get the Parent's value.
"""

parent = models.CharField('Parent Node', max_length=25)
child = models.CharField('Child Node', max_length=25)
argno = models.IntegerField('Argument Number')



$argno$ doesn’t really matter for the operations of multiplication and addition, but for others(like subtraction), it might.

Heres a Django-Admin screenshot with the required entries for $A$ and $B$.

As I mentioned earlier, we would like to store the values of the nodes, whenever they are updated. So heres a model for that:


class DataLog(models.Model):
"""
Stores information about the type of function/computation
associated with nodes.
"""

node = models.CharField('Node', max_length=25)
timestamp = models.DateTimeField('Time Stamp')
value = models.FloatField('Value', null=True)



And heres the contents of the db.py file, that define a function to store the latest value of a node into the database:


from reactive_compute.models import DataLog
from django.utils import timezone

def save_node_value(node, value):
"""
Saves the latest computed value of the given node.
"""
data_entry = DataLog(node=str(node),
timestamp=timezone.now(),
value=value)
data_entry.save()



To present a complete example, the following are the contents of the admin.py file.

from django.contrib import admin
from reactive_compute.models import *

list_display = ('parent', 'child', 'argno')
list_filter = ['parent', 'child']
search_fields = ['parent', 'child']

list_display = ('node', 'functionname', 'noofargs')
list_filter = ['functionname']
search_fields = ['node', 'functionname']

list_display = ('node', 'timestamp', 'value')
list_filter = ['node', 'timestamp']
search_fields = ['node']



And finally, we come to the crux of the implementation – the tasks.py file. If you read the resource I mentioned for Celery installation, you know that the Celery task is defined in this file. I have heavily commented it, so do go through the in-line docs.


from __future__ import absolute_import
from reactive_compute.models import Dependency, FunctionType
from reactive_compute.db import save_node_value
from reactive_compute.functions import Functions
from redis import StrictRedis
import redis_lock

def compute_nodes(nodes, timeout=60):
"""
Computes values for all computational nodes as defined in the
FunctionType model. Recursive calls are made based on information
present in respective Dependency entries.

'nodes' should be a dict mapping name of computational node, to:
A floating point number as input, OR
None, if the value is to be (possibly) computed.
'timeout' defines the time interval for which the values of the
nodes mentioned in 'nodes' will be valid. Default is a minute.
"""

#First handle the boundary case
if len(nodes) == 0:
return None

#Get a connection to Redis
conn = StrictRedis()

#This will contain all the parent nodes that we will try to compute
#recursively based on the args currently provided.
parents = set([])

#Default initialization for Celery
value = None

#Iterate over all nodes
for node in nodes:
##First obtain the value of the node
if nodes[node] is not None:
#Value has been provided as input
try:
#Ensure the given value can be parsed/converted to
#a float.
value = float(nodes[node])
except:
#If representing the value as a float fails,
#skip this one.
continue
else:
#Value has to be computed.

#First acquire lock for the particular node.
#This ensures that the inputs needed for computing
#the current node don't get modified midway(unless
#one expires, in which case the computation may or may
#not go through).
lock = redis_lock.RedisLock(conn, node + '_lock')
if lock.acquire():
try:
#This will indicate if all args are present for
#computation of the result
all_args_flag = True
#This will store all the required arguments in order
args = []
#Get the pertinent FunctionType instance
func_info = FunctionType.objects.get(node=node)
#Iterate over all arguments
for i in range(func_info.noofargs):
#Get Redis value
v = conn.get(node + '_' + i)
if v is None or v == 'None':
#If value not present stop iterations
all_args_flag = False
break
else:
args.append(float(v))
#If any arg was absent, abort processing of this node
if not all_args_flag:
continue
#Compute the value, since all args are present
value = Functions.mapping[func_info.functionname](
args)
for i in range(func_info.noofargs):
conn.delete(node + '_' + i)
except:
pass
finally:
#Release lock
lock.release()

##Now that the value has been obtained, update the args info
##for all parent nodes
parent_objs = Dependency.objects.filter(child=node)
for parent_obj in parent_objs:
#Get lock for parent
lock = redis_lock.RedisLock(conn, parent_obj.parent + '_lock')
if lock.acquire():
try:
#Set value
conn.set(parent_obj.parent + '_' + parent_obj.argno,
value)
#Set expiry time
conn.expire(parent_obj.parent + '_' + parent_obj.argno,
timeout)
except:
pass
finally:
#Release lock
lock.release()
#Add this parent to the set of parents to process

#Save value in database as needed
save_node_value(node, value)

#Make the recursive call on parent nodes
compute_nodes.delay(dict((parent, None) for parent in parents),
timeout)



Given below is an example of using the above task in a simple Django view. It accepts the complete input as a stringified JSON object (like “$\{"C": 4, "D": 5\}$“), and makes a call to the Celery task.

from django.http import HttpResponse
from django.views.decorators.csrf import csrf_exempt

@csrf_exempt
def input_values(request):
#Get JSON-ed values dictionary
value_dict = eval(request.GET.get('input'))
#Make Celery call
compute_nodes.delay(value_dict)

return HttpResponse("OK")



Lets run the server and see how things work.
Heres that the data-logs look like given the input “$\{"C": 4, "D": 5\}$“.

Within a minute, we provide “$\{"E": 6\}$“.

Notice how $A$ gets computed as well. Now let us provide all inputs, with new values: “$\{"C": 1, "D": 1, "E": 1, "F": 3\}$“.

As you must have observed, $B$ is computed with the latest value of $A$, which is 3.

One unusual flexibility that this codes provide, is that we can update the values of nodes that are ‘usually’ derived, manually. Heres what happens if we provide “$\{"A": 4, "F": 5\}$“.

You don’t need to Copy-Paste this code anywhere, its all present in my github repo. Feel free to fork it and send a PR if you can modify it in some useful way!

Thoughts

1. The way its implemented, you have complete control over how each computational node works. I have used simple mathematical operations, but you could have your own metrics/functionality written in. As long as you define the arguments and order them right, things will work well.

2. I have worked with floating-point numbers, but your arguments could as well be vectors! Then you might have to run eval over your Redis values, instead of a simple float() conversion. But since every call is made as a Celery task, your runtime will still be optimized.

3. Earlier, I mentioned that you could remove the restriction of deleting previous values of arguments (as long as they are valid). You could do this pretty easily by removing lines 85-87 in the tasks.py code. If you don’t want your node values to have a sense of time-based validity at all, you could just remove lines 106 and 107 from tasks.py.

4. If you want different values to be stored in different ways into the database, you could add another parameter to the $FunctionType$ model, that specified a String that denotes how information about that particular node is added to the database. Ideally, this database storing would also be a Celery task.

5. You could also store the Data-Flow diagram better using a No-SQL database such as MongoDB.

Thats all for now :-). Cheers!

# Fuzzy Speciation in Genetic Algorithms using k-d Trees

You would need to know the basics of Genetic Algorithms to understand this post. AI Junkie’s tutorial is a good place to get to upto speed. You would also need to know what k-D Trees are, for which the Wiki article is a good reference.

What is speciation?

Speciation, in simple terms, is an evolutionary process by which new biological species arise. This concept is used in the implementation of evolutionary algorithms (such as GAs) as a form of optimization. To understand speciation, its necessary to understand what we mean by ‘species’.

A species is a group of organisms that – 1) possess some common prominent traits, and 2) can breed/mate with one-another. For example, consider cats and leopards. They share many characteristics, and hence exist in the same ‘family’ of animals. However, since they can’t really mate to provide a cat+leopard offspring, they don’t belong to the same species. The part 2) of the above definition may not be applicable in all cases (such as asexual plants). Nevertheless, it is what is important to us in the context of evolutionary algorithms.

Basically, a genetic algorithm (that implements speciation) will divide the population of candidates into a number of species. Essentially, each species will be a group of solutions that are allowed to crossover with each other. Candidates belonging to different species rarely mate.

Why do we need speciation?

A Genetic Algorithm chiefly deals with two genetic operations: Crossover and Mutation. Mutation works on a single candidate solution by perturbing it in some small way. Crossover, on the other hand, takes two candidate ‘parents’ (selected based on their fitness) and produces offspring(s) that share parts of their genome with the parents. This seems good enough, but it has some serious pitfalls.

Many times, crossover has a pretty high rate of failure (that is, it produces very inferior solutions). The reason for this, in simple terms, is: though both parents may individually have very high levels of fitness, the result of combining random parts of their genome together can be sub-optimal. This tends to happen when the parents come from distant parts of the search space, causing the offspring to lie in some entirely new region that has no optima in the vicinity. For example, imagine you are trying to evolve the fastest mammal. During a crossover, you select a deer (a fast mammal), and a cheetah(the fastest indeed). In this case, even if you could breed their offspring, it may have a random mixture of the anatomical features of its parents thats disastrous from the point of view of speed. In other words, mating fit parents does not guarantee a fit offspring, if the properties of the parents are vastly different.

In cases like these, speciation comes to the rescue. By restricting mating between greatly dissimilar candidates, evolution of sub-par candidates is minimized.

How is speciation implemented?

Usually, the two most common ways of implementing speciation in evolutionary algorithms are: Threshold Speciation and K-Means/Medoids clustering. Both are explained pretty well in Jeff Heaton’s book on Nature-Inspired algorithms in Artificial Intelligence.

The key feature of speciation is the fact that it prevents mating among candidates with highly differing characteristics. This can be thought of as promoting breeding between candidates that lie close to each other in the search-space. To achieve the same effect, I have been using k-d trees to perform nearest-neighbour searches to provide a form of speciation to my implementation of GAs. Its pretty simple(and intuitive) in terms of its approach, doesn’t degrade performance by too much and has been providing good results so far. So, here’s a basic outline of it for you all.

Implementation using k-D Trees

I will outline the algorithm first, and point out the nuances later.

First off, you must have a way of computing the ‘distance’/dissimilarity between candidate solutions. If your genome has a fixed length, you can use Euclidean distance.

Extra attributes required: 1. Starting Temperature (T), and 2. Neighbourhood Value(V). T must be in the range (0. 1). The higher this value, the greater is the tendency to neglect speciation (especially in the earlier iterations). The Neighbourhood Value V denotes the number of nearest candidates that will be considered in searches.

Algorithm:

I) The following selection process is repeated in each iteration until the new population size does not reach the required value:

Initial Values: Temperature t = T

i. Pick a random candidate parent based on fitness (using something like Roulette-Wheel selection, or Tournament selection)

ii.a. Pick a random number in (0, 1). If it is lesser than the current value of t, pick another parent randomly using the same selection process used in step i. If not, goto step ii.b.

ii.b. Using the k-D Tree trained over candidates from the current iteration, find the V nearest candidates to the parent selected in step i. Choose one randomly (using the process employed in step i), based on fitness.

iii. Use crossover with parents produced in steps i and ii. Mutate as/when appropriate. Add offspring to the pool of candidates of the next generation.

II) Once the new population grows to the required size, a new k-D Tree is generated using the set of all new candidates for the next iteration.

III) Decrease t for the next iteration.

Pointers

1) First of all, you need to have a way to compute dissimilarity between candidate solutions (apart from their fitness). As mentioned, if your genome size is constant, you could always use good-old Euclidean distance. If not, you might have to be creative. If you are using Scikit-Learn’s implementation of k-D Trees like me, you could check out the metrics it supports by doing  kd_tree.valid_metrics  over an instance.

2) Because the selection of the first parent in step i. depends only by the global measure for fitness, parts of the search space that generate good solutions will have higher numbers of candidates in subsequent iterations.

3) Unlike using K-Medoids for speciation, you don’t need to have a strict definition of the species that a given candidate belongs to. Nearest-neighbour searching ensures that a candidate usually mates only with other candidates that are similar to it.

4) As you might have understood, the Temperature parameter only adds some amount of exploratory tendencies to the algorithm. The higher the temperature, the greater is the tendency to ignore speciation. The decrease in t could most likely be done linearly from the user-provided starting point T, to a pre-defined minimum value(like 0.05), across iterations.

Effects on Performance of the GA

As mentioned in the discussion here, it doesn’t make much sense to analytically define the time-complexity of GAs. However, let me give you a rough idea of how things will change. Assume the total size of the population is $n$. First off, Roulette-Wheel selection (done in step i) takes $O(n)$ time if done in a naive way (going through all candidates in some order one by one). If you do it using the stochastic acceptance method, it takes $O(1)$ time. Usually, this selection will be done twice during every crossover. In case of the algorithm outlined above, for approximately $(1-t)$ fraction of the second crossovers, the Roulette-wheel selection will be replaced by an $O(log(n))$ search using the k-D Tree and a much smaller (almost $O(1)$) Roulette Wheel selection amongst the nearest V candidates. Hence, unless your population size is very, very large, this won’t affect time that drastically.

The only significant increase in time will be seen as a result of the k-D Tree learning at the end of each iteration, which would add an $O(nlog(n))$ time to the time required for generating the next-gen pool of candidates (which is $O(n^2)$ for naive implementations anyways). Since $O(n^2 + nlog n)$ is equivalent to $O(n^2)$, the performance decrease isn’t that significant, provided the size of your population isn’t really high. Even if you were using a highly efficient selection procedure during crossovers, for reasonable population sizes, your total run-time would only be multiplied by a factor less than 2 (mainly because crossover and mutation by themselves take some time too).

Thats all for now! Keep reading and cheers 🙂

# Using Circular Self-Organizing Maps to solve the Symmetric TSP

Prerequisites:

1. Knowing what the Traveling Salesman Problem (TSP) is. The wiki article is a good place to start. The symmetrical form of the problem is where the distance from one city to another is the same in both directions.

2. A decent understanding of what Kohonen/Self-Organizing Maps are. This blog post will point you in the right direction to learn more.

After I wrote my former blog post on implementing SOMs with TensorFlow, someone on Reddit mentioned that there are ways to solve TSPs using them. When I did some research, it appeared that there have been a few attempts to do so. Heres one, and heres another. So in my spare time, I decided to try and have a crack at it myself.

I will describe my approach (along with the code) in a short while, but let me give a disclaimer first. This post in no way claims to have found the ‘best’ way to solve TSPs (ACOs are pretty much the state of the art). It just describes my attempt, and the results I obtained (which were interesting, with respect to the methodology implemented).

I obviously make a few assumptions, and here they are:

1. Every city has a corresponding vector available, denoting its coordinates/location in a 2D/3D space. This source presents many TSP benchmark data-sets in this format. I also tried assigning every city a vector where every dimension is its distance from all the cities(including itself). This gives pretty sub-par results, so I decided not to pursue that. Maybe you could use a dimensionality reduction technique like PCA on these ‘distance’ vectors and get good results.

2. Because of the above point, my method is pretty much restricted to solving the symmetric form of a TSP (atleast the way I have done it).

The Method

Suppose the total number of neurons is $N$. The total number of cities in question is $C$. After some trial and error, I set $N = 2C$. It can be higher/lower as well. The lower the number, the less optimal is the understanding of the solution space. And as you go higher, after a certain point, there seems to be no improvement in the output (but your execution time increases many-fold).

The individual neurons are provided locations around a circle, as follows:

$l_i = ( cos(\frac{2i\pi}{N}), sin(\frac{2i\pi}{N}) )$

where $l_i$ is the location vector for the $i$ th neuron in a 2-D space. $i$ goes from $0$ to $N - 1$. It is important to not confuse this space with the space in which the cities lie (even though both are 2-dimensional). Lets call one the Neuron-Space, and the other, the City-Space.

Because of the circular nature of this SOM, we inherently have a closed loop-structure to our framework (as is needed by the TSP solution), where neurons in proximity of each other in the Neuron-Space lie near-by in the City-Space. Also, do note how every neuron essentially corresponds to an angle ( $\frac{2i\pi}{N}$) around the circle. This point would be useful soon.

The initialization of the neuron vectors in City-Space is done by using the mean and standard deviation of the dimensions of the city vectors in City-Space. This is because I observed that the scale of all the dimensions in the city-vectors is pretty much the same. If it isn’t, you might have to resort to building a mean vector and covariance matrix. For training, I use $\alpha = 0.3$ and $\sigma = 0.3$ as parameters for the SOM. Putting a constant value of $\sigma$ ensures that the same proportion of neurons around the Best-Matching Unit get affected everytime, irrespective of the total number of neurons being used.

Once the SOM trained, we can say we have the rough ‘skeleton’ for a good route in the City-Space. But how do you convert this skeleton to the appropriate route? By using those angles I just talked about. Consider a city $c$. Lets say the $k$ nearest neurons to it, are $n_1, n_2, ..., n_k$.

So, we can construct a vector for every city in Neuron-Space, using this simple formula:

$l_c = \sum_k \frac{1}{d_{c, n_k}} l_c$

where $d_{c, n_k}$ is the distance between city $c$ and neuron $n_k$ in City-Space. Using this vector and some pretty simple (inverse) trigonometry, you can assign the city an angle around the circle on which the SOM-neurons themselves lie in Neuron-Space. Now all you need to do to come up with an approximately good solution, is just sort the cities with respect to the angles they are assigned. Voila!

Ofcourse, the solution that this algorithm gives you would most likely be in the neighbourhood of a slightly better one, so I use a standard TSP local search heuristic like 3-opt over the solution obtained from my algorithm. What 3-opt essentially does, is change around the solution obtained from some other algorithm over so slightly in some pre-defined ways, in the hope of hitting on a better solution. 3-opt is better than 2-opt for this algorithm. I dint try higher-order k-opt searches.

One interesting possibility about this algorithm, is the fact that it could provide great solutions to solving TSPs for different target cities, without retraining the SOM– as long as they lie in the same City-Space. As long as the set of cities you try to solve for follows (atleast approximately) the trends shown by the dataset used for training, all you would need to do is compute their respective angles (which does not involve re-training), and just apply 3-opt. In short, you could use the same SOM, once trained, to solve for multiple subsets of cities.

The Implementation

This code builds on my Tensorflow-based SOM code. Its pretty rough around the edges, especially the 3-opt search. It was meant just as a proof-of-concept, so feel free to hack around if you like.


import tensorflow as tf
import numpy as np

def _distance(tsp_solution, distance_matrix):
"""
Total distance for a solution.
"""
total_distance = 0
for i in range(len(tsp_solution)-1):
total_distance += (distance_matrix[tsp_solution[i]][
tsp_solution[i + 1]])
total_distance += (distance_matrix[tsp_solution[0]][
tsp_solution[-1]])

def _3_opt(route, distance_matrix):
"""
Uses 3-opt to find a better solution in the neighbourhood.
"""

current_dist = _distance(route, distance_matrix)
#Find three connections to break
for i in range(len(route) - 1):
for j in range(i + 2, len(route) - 1):
for k in range(j + 2, len(route) - 1):
#Now reconnect the graph in every other way possible,
#and see if any gives a better solution
#Way 1
temp = route[:]
temp[j+1:k+1] = reversed(route[j+1:k+1])
dist = _distance(temp, distance_matrix)
if dist < current_dist:
route = temp
current_dist = dist
#Way 2
temp = route[:]
temp[i+1:j+1] = reversed(route[i+1:j+1])
dist = _distance(temp, distance_matrix)
if dist < current_dist:
route = temp
current_dist = dist
#Way 3
temp = route[:]
temp[i+1:j+1], temp[j+1:k+1] = (
reversed(route[i+1:j+1]), reversed(route[j+1:k+1]))
dist = _distance(temp, distance_matrix)
if dist < current_dist:
route = temp
current_dist = dist
#Way 4
temp = (route[:i+1] + route[j+1:k+1] +
route[i+1:j+1] + route[k+1:])
dist = _distance(temp, distance_matrix)
if dist < current_dist:
route = temp
current_dist = dist
#Way 5
temp = route[:i+1] + route[j+1:k+1]
temp += reversed(route[i+1:j+1])
temp += route[k+1:]
dist = _distance(temp, distance_matrix)
if dist < current_dist:
route = temp
current_dist = dist
#Way 6
temp = route[:i+1]
temp += reversed(route[j+1:k+1])
temp += route[i+1:j+1]
temp += route[k+1:]
dist = _distance(temp, distance_matrix)
if dist < current_dist:
route = temp
current_dist = dist
#Way 7
temp = route[:i+1]
temp += reversed(route[j+1:k+1])
temp += reversed(route[i+1:j+1])
temp += route[k+1:]
dist = _distance(temp, distance_matrix)
if dist < current_dist:
route = temp
current_dist = dist
return route

class TSPSolver(object):
"""
Uses a Circular Self-Organizing Map with a Gaussian Neighbourhood
function and linearly decreasing learning rate, to solve TSP.
"""

#To check if the SOM has been trained
_trained = False

def __init__(self, input_vects, n_iterations=200):
"""
Initializes all necessary components of the TensorFlow
Graph.

'input_vects' should be a list of 1-D NumPy vectors denoting
location vectors for the cities.
'n_iterations' is an integer denoting the number of iterations
that would be run while training the SOM.
"""

#Assign required variables first
self._n = len(input_vects)
dim = len(input_vects[0])
self._n_centroids = self._n*3
alpha = 0.3
sigma = 0.3
self._n_iterations = abs(int(n_iterations))

#For initialization of the centroid(neuron) vectors
init_values = []
for vect in input_vects:
init_values.extend(list(vect))
init_mean = np.mean(init_values)
init_dev = np.sqrt(np.var(init_values))

##INITIALIZE GRAPH
self._graph = tf.Graph()

##POPULATE GRAPH WITH NECESSARY COMPONENTS
with self._graph.as_default():

##VARIABLES AND CONSTANT OPS FOR DATA STORAGE

#Randomly initialized weightage vectors for all neurons,
#stored together as a matrix Variable of size [self._n, dim]
self._weightage_vects = tf.Variable(tf.random_normal(
[self._n_centroids, dim], init_mean, init_dev))

#Matrix of size [self._n, 2] for SOM grid locations
#of neurons
self._location_vects = tf.constant(np.array(
list(self._neuron_locations(self._n_centroids))))

##PLACEHOLDERS FOR TRAINING INPUTS
#We need to assign them as attributes to self, since they
#will be fed in during training

#The training vector
self._vect_input = tf.placeholder("float", [dim])
#Iteration number
self._iter_input = tf.placeholder("float")

##CONSTRUCT TRAINING OP PIECE BY PIECE
#Only the final, 'root' training op needs to be assigned as
#an attribute to self, since all the rest will be executed
#automatically during training

#To compute the Best Matching Unit given a vector
#Basically calculates the Euclidean distance between every
#neuron's weightage vector and the input, and returns the
#index of the neuron which gives the least value
bmu_index = tf.argmin(tf.sqrt(tf.reduce_sum(
tf.pow(tf.sub(self._weightage_vects, tf.pack(
[self._vect_input for i in range(
self._n_centroids)])), 2), 1)),
0)

#This will extract the location of the BMU based on the BMU's
#index
np.array([[0, 1]]))
bmu_loc = tf.reshape(tf.slice(self._location_vects, slice_input,
tf.constant(np.array([1, 2]))),
[2])

#To compute the alpha and sigma values based on iteration
#number
learning_rate_op = tf.sub(1.0, tf.div(self._iter_input,
self._n_iterations))
_alpha_op = tf.mul(alpha, learning_rate_op)
_sigma_op = tf.mul(sigma, learning_rate_op)

#Construct the op that will generate a vector with learning
#rates for all neurons, based on iteration number and location
#wrt BMU.
bmu_distance_squares = tf.reduce_sum(tf.pow(tf.sub(
self._location_vects, tf.pack(
[bmu_loc for i in range(self._n_centroids)])), 2), 1)
neighbourhood_func = tf.exp(tf.neg(tf.div(tf.cast(
bmu_distance_squares, "float32"), tf.pow(_sigma_op, 2))))
learning_rate_op = tf.mul(_alpha_op, neighbourhood_func)

#Finally, the op that will use learning_rate_op to update
#the weightage vectors of all neurons based on a particular
#input
learning_rate_multiplier = tf.pack([tf.tile(tf.slice(
learning_rate_op, np.array([i]), np.array([1])), [dim])
for i in range(
self._n_centroids)])
weightage_delta = tf.mul(
learning_rate_multiplier,
tf.sub(tf.pack([self._vect_input for i in range(
self._n_centroids)]),
self._weightage_vects))
weightage_delta)
self._training_op = tf.assign(self._weightage_vects,
new_weightages_op)

##INITIALIZE SESSION
self._sess = tf.Session()

##INITIALIZE VARIABLES
init_op = tf.initialize_all_variables()
self._sess.run(init_op)

#Train the Solver
self.train(input_vects)

@property
def solution(self):
"""
Returns the TSP solution order as a list, and the total
Euclidean distance with respect to the input vectors.
"""
if not self._trained:
raise ValueError("SOM not trained yet")
return self._sol, self._sol_dist

@classmethod
def _neuron_locations(cls, n_centroids):
"""
Yields one by one the 2-D locations of the individual neurons
in the SOM.
"""

for i in range(n_centroids):
yield np.array([np.cos(i*2*np.pi/float(n_centroids)),
np.sin(i*2*np.pi/float(n_centroids))])

def train(self, input_vects):
"""
Trains the SOM.
'input_vects' should be a list of 1-D NumPy arrays with
dimensionality as provided during initialization of this SOM.
The points in the solution are enumerated as they are ordered
in 'input_vects'.
Current weightage vectors for all neurons(initially random) are
taken as starting conditions for training.
"""

#Training iterations
for iter_no in range(self._n_iterations):
#Train with each vector one by one
for input_vect in input_vects:
self._sess.run(self._training_op,
feed_dict={self._vect_input: input_vect,
self._iter_input: iter_no})

self._input_vects = input_vects
self._compute_solution()
self._trained = True

def _compute_solution(self):
"""
Computes the solution to the TSP with respect to the input
vectors.
"""
#TODO: All nearest-neighbour searches could be speeded up
#significantly with a KD-Tree/Ball Tree.

#Get the centroid data
centroid_vects = list(self._sess.run(self._weightage_vects))
centroid_locations = list(self._sess.run(self._location_vects))

#Distance matrix mapping input point number to list of distances
#from centroids
distances = {}

for point in range(self._n):
distances[point] = []
for centroid in range(self._n_centroids):
distances[point].append(
np.linalg.norm(centroid_vects[centroid] -
self._input_vects[point]))

#Distance matrix mapping input point number to list of distances
#from other input points
point_distances = {}

for point in range(self._n):
point_distances[point] = []
for other_point in range(self._n):
point_distances[point].append(
np.linalg.norm(self._input_vects[other_point] -
self._input_vects[point]))

#Compute angle with respect to each city(point)
point_angles = {}
for point in range(self._n):
total_vect = 0
cents = [j for j in range(self._n_centroids)]
cents.sort(key=lambda x: distances[point][x])
for centroid in cents[:2]:
total_vect += (1.0/(distances[point][centroid]) *
centroid_locations[centroid])
total_vect = total_vect/np.linalg.norm(total_vect)
if total_vect[0] > 0 and total_vect[1] > 0:
angle = np.arcsin(total_vect[1])
elif total_vect[0] < 0 and total_vect[1] > 0:
angle = np.arccos(total_vect[0])
elif total_vect[0] < 0 and total_vect[1] < 0:
angle = np.pi - np.arcsin(total_vect[1])
else:
angle = 2*np.pi - np.arccos(total_vect[0])
point_angles[point] = angle

#Find the rough solution
tsp_solution = [i for i in range(self._n)]
tsp_solution.sort(key=lambda x: point_angles[x])

tsp_solution = _3_opt(tsp_solution, point_distances)

#Compute the total distance for the solution
total_distance = _distance(tsp_solution, point_distances)

self._sol = tsp_solution
self._sol_dist = total_distance



Results

Each of these problem datasets can be found at the source here. All the presented results are over 10 runs. The max runtime was experienced with krA100, around one minute.

Berlin-52: Best Error: 0.0%, Average Error: 4.1%.

krA100: Best Error: 1.6%, Average Error: 2.7%.

eli76: Best Error: 2%, Average Error: 4.4%.

These results aren’t ‘close’ to those provided by ACOs, but they are practically feasible, and show a good promise (Oh now I just sound like I am writing a research paper.) Plus, the algorithm isn’t a time-hogger either (judging by runtimes on my laptop).

Anyways. Thanks for reading! Do drop a comment if you think of an improvement over this work.

# Backpropagation for dummies

Backpropagation is a method of training an Artificial Neural Network. If you are reading this post, you already have an idea of what an ANN is. However, lets take a look at the fundamental component of an ANN- the artificial neuron.

The figure shows the working of the ith neuron (lets call it $N_i$) in an ANN. Its output (also called its activation) is $a_i$. The jth neuron $N_j$ provides one of the inputs to $N_i$.

How does $N_i$ produce its own activation?

1. $N_i$ has stored in itself, a weightage assigned to each of its inputs. Lets say that the weightage assigned by $N_i$ to $N_j$ is $w_{i, j}$. As the first step, $N_i$ computes a weighted sum of all its inputs. Lets call it $in_i$. Therefore,

$in_i = \sum_{k \in Inputs(N_i)} w_{i, k}a_k$ … (1)

where $Inputs(N_i)$ corresponds to all neurons that provide an input to $N_i$.

2. This sum, called the input function, is then passed on to $N_i$‘s activation function. It is called so, because it computes the activation that will be provided by $N_i$ to the units it in turn acts as an input to. Therefore,

$a_i = g(in_i)$ … (2)

Usually, the activation functions of all neurons in an ANN are the same, so we just write $g$ instead of $g_i$.

There are many options for $g$, such as linear functions, rectified linear functions, hyperbolic tan, etc. For the sake of this tutorial, we will use the sigmoid function (I’ll explain why later). So we have

$g(x) = \frac{1}{1 + e^{-x}}$ … (3)

Thus, every neuron in an ANN – except the ones in the lowest-level input layer, and bias neurons – takes in the activation of all its inputs, and provides its own activation as an output.

Once a neural network has been designed, it is the job of the training process to ensure that the individual weightages $w_{i, j}$ given by each neuron to each of its inputs is set just right, so that the whole network gives the right output. Backpropagation is one of the ways to optimize those weights.

Whenever we talk of optimization, it is necessary to properly define what we are optimizing and how we are doing it.

What are we trying to optimize?

The error function. This function defines how much the output of our model (in this case an ANN) defers from the required output. For the sake of this tutorial, we will use the Mean-Squared Error(MSE) function. For example, let $N_o$ be one of the output neurons of our ANN. For a given training sample, let the expected output be $t_o$. However, $N_o$ provides an output equal to its activation value, that is $a_o$. Therefore, the error as defined by MSE would be

$E = \frac{1}{2} (t_o - a_o)^2$ … (4)

Since $a_o$ by its definition depends on the weightages given by $N_o$ to its inputs, the ultimate valye of $E$ depends on those weightages too.

It is important to remember that every training sample will result in a different error value. It is backpropagation’s goal to minimize (bring as close to zero as possible) the value of $E$ for all training samples on an average. The way this is done, as mentioned before, is by fine-tuning the values of the weights $w_{i, j}$.

But how does backpropagation fine tune these weightage values? By using a technique called Gradient Descent.

Consider a function as follows:

$f(x) = 3x + 4$

The gradient of the function $f$ with respect to the variable $x$ defines how much the value of $f$ will change with a unit increase/decrease in the value of $x$. For the above example, the answer to that is 3. Basically, you are differentiating $f$ with respect to $x$. Generally speaking, a function may be dependent on a variety of parameters, so you take a partial derivative when computing the gradient.

Consider you are on a hill like the one shown in the image below:

The red regions correspond to places of higher function value, while the blue regions correspond to low values. You want to reach the lowest point possible. If you are standing on the red hill, and you can only see a small part of the terrain around you, what will you do? You will naturally take steps in the direction in which the terrain slopes down. With each step you take, you will scan your immediate neighbourhood (which is visible to you), and go in the direction that shows the steepest descent.

In mathematical terms, this ‘direction’ is nothing but the gradient of your height (the target function you are trying to minimize), with respect to your location (x, y) (which are the parameters you are fine tuning). The overall direction in which you move, will be dependent on the gradients with respect to both dimensions, i.e. $\frac{\partial height}{\partial x}$ and $\frac{\partial height}{\partial y}$. That’s pretty much what gradient descent does.

Its called so, because you are descending from a higher value of the target function to a lower value, by following the direction that corresponds to the lowest gradient (and consequently the steepest decrease in function value).

One thing to note is that you would always want to move in the direction shown by the negation of the gradient. Consider the above example of $f(x) = 3x + 4$ . The gradient here is 3, which is positive. So, with every unit increase in $x$, $f(x)$ would increase by 3. Therefore, to reduce $f(x)$, you would decrease the value of $x$ . The opposite is true in scenarios where you want to maximize the target function. In those cases, you move in the direction of the gradient itself.

Coming back to the context of ANN training, backpropagation reduces the error function $E$ by taking help of the gradients with respect to individual weights, $\frac{\partial E}{\partial w_{i, j}}$.

How does backpropagation employ gradient descent?

First off, it is necessary to keep in mind that backpropagation tries to optimize the inter-neuron weights, and not the neurons themselves. Their functionality is pretty much decided when you decide the activation function.

Secondly, the error function value $E$ depends on the current values of the weights and the training sample in question, apart from the definition of the activation function. So to compute the error with respect to a training sample, you first feed it as an input to the network. You will then receive the appropriate output(s) from the output neuron(s). This part is known as  ‘forward propagation‘, since you are going from the input layer to the output layer (through the hidden layers if any). If you have multiple output neurons, the error values corresponding to each will affect the weights appropriately.

Then, you use the network output(s) along with the expected value(s) to calculate the errors using Equation 4 given above. Once the error values are known, the weights are updated using the gradients. This updating occurs first for the out-most weights, and then for lower layers. The intuition behind this, is the fact that the errors with respect to the outmost neurons are known first, and you derive the errors of the inner neurons from them. Each layer’s errors are calculated from the errors of the neurons they provide input to. That is why the second phase is called ‘backward propagation‘, or back-propagation.

So as a first step, lets try to understand how we will calculate the gradient $\frac{\partial E}{\partial w_{i, j}}$ for a given training sample. Remember that  $w_{i, j}$ is the weightage given by $N_i$ to $N_j$.

Using the chain rule in Leibniz’s form,

$\frac{\partial E}{\partial w_{i, j}} = \frac{\partial E}{\partial in_i} \frac{\partial in_i}{\partial w_{i, j}}$ … (5)

From Equations 1 and 5, and using basic differentiation,

$\frac{\partial E}{\partial w_{i, j}} = a_j \frac{\partial E}{\partial in_i}$ … (6)

From the basic definition of the activation function (and the appropriate current weight values), you already know the value of $a_j$. Now consider the second term in the above product. Using the chain rule again on it, we get

$\frac{\partial E}{\partial in_i} = \frac{\partial E}{\partial a_i} \frac{\partial a_i}{\partial in_i}$ … (7)

This is where the whole advantage of using the sigmoid activation function comes into play. The derivative of a sigmoid function with respect to its input, corresponds to a nice expression. Given equation 3, you can easily verify that

$\frac{dg}{dx} = g(x)(1 - g(x))$ … (8)

From Equations 7 and 8,

$\frac{\partial E}{\partial in_i} = a_i (1 - a_i) \frac{\partial E}{\partial a_i}$ … (9)

Putting the information in Equations 6 and 9 together,

$\frac{\partial E}{\partial w_{i, j}} = a_j a_i (1 - a_i) \frac{\partial E}{\partial a_i}$ … (10)

Look at the last term. There are two possible cases for $N_i$ ($N_i$ cannot be an input neuron, since it is receiving input from another neuron) :

I. $N_i$ is an output neuron. In this case, the second term on the RHS is pretty simple, from Equation 4-

$\frac{\partial E}{\partial a_i} = - (t_i - a_i)$ … (11)

where $t_i$ is the expected output from $N_i$ (for the current training sample).

2. $N_i$ is an inner neuron. Consider all neurons $N_k$, for whom $N_i$ acts an an input. Since we are propagating backward, you can therefore assume that we know the values $\frac{\partial E}{\partial a_k}$ for all of them. In fact, since the output of $N_i$ affects the outputs of all these $N_k$ s, the expression for  $\frac{\partial E}{\partial a_i}$ will involve all of them.

Using simple chain rule,

$\frac{\partial E}{\partial a_i} = \sum_{k} \frac{\partial in_k}{\partial a_i} \frac{\partial E}{\partial in_k}$ … (12)

From Equations 12, 1 and 9,

$\frac{\partial E}{\partial a_i} = \sum_{k} w_{k, i} a_k (1 - a_k) \frac{\partial E}{\partial a_k}$ … (13)

Equations 10, 11 and 13 pretty much define the gradient values $\frac{\partial E}{\partial w_{i, j}}$ for all weights in a given ANN. So how are the weights actually updated?

Using the following rule:

The change caused by a given training sample, at time instance ‘t’, for the weight $w_{i, j}$ is given by

$\Delta w_{i, j}^{t} = - \epsilon \frac{\partial E}{\partial w_{i, j}} + \alpha \Delta w_{i, j}^{t - 1}$

Remember me mentioning that we always need to move in the direction of the gradient’s negation? Thats why the minus sign for the first term. The value $\epsilon$ is called the learning rate. We should never be too hasty in changing the weight’s value too drastically based on the gradient with respect to one training sample. Therefore, the gradient value is multiplied by the learning rate (usually well below 0.5) before causing the required change in the weightage value. If the learning rate is too low, the algorithm takes too long to train. On the other hand, if its too high, the algorithm keeps ‘bouncing around’ in the search space. Imagine you are running towards point B from point A. But you run so fast that you cant stop in time at point B, but you actually end up far ahead of it. Then you again run in the opposite direction towards B, once again crossing it and stopping ahead because of your high speed. Thats pretty much what happens if your learning rate is too large.

A variant of backpropagation, called resilient propagation(RPROP) improves on this classic method by having custom deltas with respect to each weightage in a network. It only uses the gradients as references to determine which direction to move in. Therefore, you don’t have to provide the learning rate as a parameter to resilient propagation.

Now look at the second term in the RHS above. The value $\alpha$ is called the momentum. Sometimes, gradient descent tends to get stuck in local optima in the search space. A local optimum is a point where the value is low compared to the region around it, but not all of the domain overall. The momentum term causes the current weight update to be dependent on the last weight update (and therefore not be drastically different), thus preventing getting stuck in local optima in most cases. If it is indeed a global optimum, the search will eventually come back to the required point as the momentum will go on reducing. Nesterov Momentum is another way of optimizing the use of momentum in gradient descent.

How frequently are the weights updated?

There are 3 answers to this:

1. Online Mode

In this mode, the weights are updated after every training sample. In other words, the forward propagation as well as the backward propagation phases are run for each training sample, every single time. This mode is appropriate if you are receiving your training data sequentially. However, if your training set is large, this mode of training can be pretty time-consuming. Moreover, it is not appropriate to update weights after every sample (in most cases), due to the possible presence of outliers.

2. Batch Mode

In this mode, the weights are updated after accumulating gradients with respect to all samples in the training set. In other words, the forward propagation is run with respect to all samples, and the backward phase is run just once with respect to the accumulated results of the entire batch. This is done for multiple iterations (obviously). The batch mode of training is thus more resistant to variance in the training dataset.

3. Stochastic Mode

The stochastic mode adds randomization to the mix. In the mini-batch method of stochastic learning, small sets of samples are chosen randomly from the dataset, and training is done in batch mode over these mini-batches. Stochastic mode with mini-batches provides a nice trade-off between the pros and cons of online and batch modes of training. Moreover, since you are randomly choosing the mini batches (instead of organizing your dataset into pre-defined subsets), you avoid getting stuck in local optima. This is mainly because you are providing weight updates with respect to a potentially different subset of training samples during every iteration.

EDIT 14/01/2016

Xavier’s initialization

Usually, the weights in a Neural Network are initialized to random values prior to training. This is okay if your network is shallow. But for deep learning applications, where the number of hidden layers may be 3 or more, a wrong selection of initial weights can cause convergence to sub-optimal models.

For example, assume that you are using a sigmoid activation function and your initial weights are pretty small (in the 0-1 range). If you manually calculate the activation values in a neural network across 1-2 layers, you will realize that almost all neurons in the topmost layer give activations around 0.5 (the mean value for the sigmoid function). This phenomenon is called ‘saturation’. Its described well in this paper by Glorot and Bengio.

The solution to this problem, is Xavier’s initialization.  This technique tries to optimize the generation of initial weights, by applying simple statistics over the number of input and output connections corresponding to a neuron. A good intuitive explanation of the method is given at Andy’s blog. As a note to my future self, here’s the TensorFlow implementation of Xavier’s algorithm.

# Self-Organizing Maps with Google’s TensorFlow

[This post assumes that you know the basics of Google’s TensorFlow library. If you don’t, have a look at my earlier post to get started.]

A Self-Organizing Map, or SOM, falls under the rare domain of unsupervised learning in Neural Networks. Its essentially a grid of neurons, each denoting one cluster learned during training. Traditionally speaking, there is no concept of neuron ‘locations’ in ANNs. However, in an SOM, each neuron has a location, and neurons that lie close to each other represent clusters with similar properties. Each neuron has a weightage vector, which is equal to the centroid of its particular cluster.

AI-Junkie’s post does a great job of explaining how an SOM is trained, so I won’t re-invent the wheel.

The Code

Here’s my code for a 2-D version of an SOM. Its written with TensorFlow as its core training architecture: (Its heavily commented, so look at the inline docs if you want to hack/dig around)


import tensorflow as tf
import numpy as np

class SOM(object):
"""
2-D Self-Organizing Map with Gaussian Neighbourhood function
and linearly decreasing learning rate.
"""

#To check if the SOM has been trained
_trained = False

def __init__(self, m, n, dim, n_iterations=100, alpha=None, sigma=None):
"""
Initializes all necessary components of the TensorFlow
Graph.

m X n are the dimensions of the SOM. 'n_iterations' should
should be an integer denoting the number of iterations undergone
while training.
'dim' is the dimensionality of the training inputs.
'alpha' is a number denoting the initial time(iteration no)-based
learning rate. Default value is 0.3
'sigma' is the the initial neighbourhood value, denoting
the radius of influence of the BMU while training. By default, its
taken to be half of max(m, n).
"""

#Assign required variables first
self._m = m
self._n = n
if alpha is None:
alpha = 0.3
else:
alpha = float(alpha)
if sigma is None:
sigma = max(m, n) / 2.0
else:
sigma = float(sigma)
self._n_iterations = abs(int(n_iterations))

##INITIALIZE GRAPH
self._graph = tf.Graph()

##POPULATE GRAPH WITH NECESSARY COMPONENTS
with self._graph.as_default():

##VARIABLES AND CONSTANT OPS FOR DATA STORAGE

#Randomly initialized weightage vectors for all neurons,
#stored together as a matrix Variable of size [m*n, dim]
self._weightage_vects = tf.Variable(tf.random_normal(
[m*n, dim]))

#Matrix of size [m*n, 2] for SOM grid locations
#of neurons
self._location_vects = tf.constant(np.array(
list(self._neuron_locations(m, n))))

##PLACEHOLDERS FOR TRAINING INPUTS
#We need to assign them as attributes to self, since they
#will be fed in during training

#The training vector
self._vect_input = tf.placeholder("float", [dim])
#Iteration number
self._iter_input = tf.placeholder("float")

##CONSTRUCT TRAINING OP PIECE BY PIECE
#Only the final, 'root' training op needs to be assigned as
#an attribute to self, since all the rest will be executed
#automatically during training

#To compute the Best Matching Unit given a vector
#Basically calculates the Euclidean distance between every
#neuron's weightage vector and the input, and returns the
#index of the neuron which gives the least value
bmu_index = tf.argmin(tf.sqrt(tf.reduce_sum(
tf.pow(tf.sub(self._weightage_vects, tf.pack(
[self._vect_input for i in range(m*n)])), 2), 1)),
0)

#This will extract the location of the BMU based on the BMU's
#index
np.array([[0, 1]]))
bmu_loc = tf.reshape(tf.slice(self._location_vects, slice_input,
tf.constant(np.array([1, 2]))),
[2])

#To compute the alpha and sigma values based on iteration
#number
learning_rate_op = tf.sub(1.0, tf.div(self._iter_input,
self._n_iterations))
_alpha_op = tf.mul(alpha, learning_rate_op)
_sigma_op = tf.mul(sigma, learning_rate_op)

#Construct the op that will generate a vector with learning
#rates for all neurons, based on iteration number and location
#wrt BMU.
bmu_distance_squares = tf.reduce_sum(tf.pow(tf.sub(
self._location_vects, tf.pack(
[bmu_loc for i in range(m*n)])), 2), 1)
neighbourhood_func = tf.exp(tf.neg(tf.div(tf.cast(
bmu_distance_squares, "float32"), tf.pow(_sigma_op, 2))))
learning_rate_op = tf.mul(_alpha_op, neighbourhood_func)

#Finally, the op that will use learning_rate_op to update
#the weightage vectors of all neurons based on a particular
#input
learning_rate_multiplier = tf.pack([tf.tile(tf.slice(
learning_rate_op, np.array([i]), np.array([1])), [dim])
for i in range(m*n)])
weightage_delta = tf.mul(
learning_rate_multiplier,
tf.sub(tf.pack([self._vect_input for i in range(m*n)]),
self._weightage_vects))
weightage_delta)
self._training_op = tf.assign(self._weightage_vects,
new_weightages_op)

##INITIALIZE SESSION
self._sess = tf.Session()

##INITIALIZE VARIABLES
init_op = tf.initialize_all_variables()
self._sess.run(init_op)

def _neuron_locations(self, m, n):
"""
Yields one by one the 2-D locations of the individual neurons
in the SOM.
"""
#Nested iterations over both dimensions
#to generate all 2-D locations in the map
for i in range(m):
for j in range(n):
yield np.array([i, j])

def train(self, input_vects):
"""
Trains the SOM.
'input_vects' should be an iterable of 1-D NumPy arrays with
dimensionality as provided during initialization of this SOM.
Current weightage vectors for all neurons(initially random) are
taken as starting conditions for training.
"""

#Training iterations
for iter_no in range(self._n_iterations):
#Train with each vector one by one
for input_vect in input_vects:
self._sess.run(self._training_op,
feed_dict={self._vect_input: input_vect,
self._iter_input: iter_no})

#Store a centroid grid for easy retrieval later on
centroid_grid = [[] for i in range(self._m)]
self._weightages = list(self._sess.run(self._weightage_vects))
self._locations = list(self._sess.run(self._location_vects))
for i, loc in enumerate(self._locations):
centroid_grid[loc[0]].append(self._weightages[i])
self._centroid_grid = centroid_grid

self._trained = True

def get_centroids(self):
"""
Returns a list of 'm' lists, with each inner list containing
the 'n' corresponding centroid locations as 1-D NumPy arrays.
"""
if not self._trained:
raise ValueError("SOM not trained yet")
return self._centroid_grid

def map_vects(self, input_vects):
"""
Maps each input vector to the relevant neuron in the SOM
grid.
'input_vects' should be an iterable of 1-D NumPy arrays with
dimensionality as provided during initialization of this SOM.
Returns a list of 1-D NumPy arrays containing (row, column)
info for each input vector(in the same order), corresponding
to mapped neuron.
"""

if not self._trained:
raise ValueError("SOM not trained yet")

to_return = []
for vect in input_vects:
min_index = min([i for i in range(len(self._weightages))],
key=lambda x: np.linalg.norm(vect-
self._weightages[x]))
to_return.append(self._locations[min_index])



A few points about the code:

1) Since my post on K-Means Clustering, I have gotten more comfortable with matrix operations in TensorFlow. You need to be comfortable with matrices if you want to work with TensorFlow (or any data flow infrastructure for that matter, even SciPy). You can code pretty much any logic or operational flow with TensorFlow, you just need to be able to build up complex functionality from basic components(ops), and structure the flow of data(tensors/variables) well.

2) It took quite a while for me to build the whole graph in such a way that the entire training functionality could be enclosed in a single op. This op is called during each iteration, for every vector, during training. Such an implementation is more in line with TensorFlow’s way of doing things, than my previous attempt with clustering.

3) I have used a 2-D grid for the SOM, you can use any geometry you wish. You would just have to modify the  _neuron_locations  method appropriately, and also the method that returns the centroid outputs. You could return a dict that maps neuron location to the corresponding cluster centroid.

4) To keep things simple, I haven’t provided for online training. You could do that by having bounds for the learning rate(s).

Sample Usage

I have used PyMVPA’s example of RGB colours to confirm that the code does work. PyMVPA provides functionality to train SOMs too (along with many other learning techniques).

Here’s how you would do it with my code:


#For plotting the images
from matplotlib import pyplot as plt

#Training inputs for RGBcolors
colors = np.array(
[[0., 0., 0.],
[0., 0., 1.],
[0., 0., 0.5],
[0.125, 0.529, 1.0],
[0.33, 0.4, 0.67],
[0.6, 0.5, 1.0],
[0., 1., 0.],
[1., 0., 0.],
[0., 1., 1.],
[1., 0., 1.],
[1., 1., 0.],
[1., 1., 1.],
[.33, .33, .33],
[.5, .5, .5],
[.66, .66, .66]])
color_names = \
['black', 'blue', 'darkblue', 'skyblue',
'greyblue', 'lilac', 'green', 'red',
'cyan', 'violet', 'yellow', 'white',
'darkgrey', 'mediumgrey', 'lightgrey']

#Train a 20x30 SOM with 400 iterations
som = SOM(20, 30, 3, 400)
som.train(colors)

#Get output grid
image_grid = som.get_centroids()

#Map colours to their closest neurons
mapped = som.map_vects(colors)

#Plot
plt.imshow(image_grid)
plt.title('Color SOM')
for i, m in enumerate(mapped):
plt.text(m[1], m[0], color_names[i], ha='center', va='center',
bbox=dict(facecolor='white', alpha=0.5, lw=0))
plt.show()



Here’s a sample of the output you would get (varies each time you train, but the color names should go to the correct locations in the image):

# Recursively copying elements from one Graph to another in TensorFlow

In my previous post on Google’s TensorFlow, I had mentioned the idea of using the library for Genetic Programming applications. In my free time, I tried to map out how such an application would be run. The focus obviously wasn’t on building the smartest way to do GP, but rather on exploring if it was practically possible. One of the ideas that stuck out, was to have a Graph for each candidate solution – which would be populated based on cross-overed elements from the parent solutions’ Graphs. This would require a good API to copy computational elements from one Graph to another in TensorFlow. I tried digging around to see if such functionality was available, but couldn’t find any (atleast from a good exploring of the github repo).

I wrote some rudimentary code to accomplish this, and heres a basic outline of how it works:

Consider the example given on TensorFlow’s Get Started page.


import tensorflow as tf
import numpy as np

# Make 100 phony data points in NumPy.
x_data = np.float32(np.random.rand(2, 100)) # Random input
y_data = np.dot([0.100, 0.200], x_data) + 0.300

# Construct a linear model.
b = tf.Variable(tf.zeros([1]))
W = tf.Variable(tf.random_uniform([1, 2], -1.0, 1.0))
y = tf.matmul(W, x_data) + b

# Minimize the squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
train = optimizer.minimize(loss)

# For initializing the variables.
init = tf.initialize_all_variables()

# Launch the graph
sess = tf.Session()
sess.run(init)

# Fit the plane.
for step in xrange(0, 201):
sess.run(train)
if step % 20 == 0:
print step, sess.run(W), sess.run(b)

# Learns best fit is W: [[0.100  0.200]], b: [0.300]



Lets say you want to copy the main training element train to another graph as defined here:


to_graph = tf.Graph()



1) You first decide on a namespace inside which all the copied elements will exist in  to_graph. This is not really required if the Graph you are copying to is empty. But its important to remember that element names matter a lot in TensorFlow’s workings. Therefore, to avoid naming conflicts, its better to define such a namespace. So basically, what would be called “C” in to_graph‘s namespace, would now be called “N/C” where “N” is the namespace String.

Lets assume the namespace we define the copied elements in our example to, is “CopiedOps”.


namespace = "CopiedOps"



2) You then copy all the variables first, one by one, using a dedicated function copy_variable_to_graph. Each time, you supply the original instance, the target graph (to_graph in this case), the namespace, and an extra dictionary called copied_variables. This dictionary is useful while copying the computational nodes (Operation instances, we will call them ops) in the next step. Since Variable instances act as inputs for ops, we need a way to keep track of them for later.

I initially wanted to combine this initialization of variables with the function that copies the computational elements, but I found it really tricky to capture the appropriate Variable instances based on their connections to ops. Anyways, since variables are more like parameter-storing units whose values are needed frequently, its better to initialize them separately.

The Variable instances in the above example are b and W. Heres how you would do it with my code:


copied_variables = {}
b1 = copy_variable_to_graph(b, to_graph, namespace, copied_variables)
W1 = copy_variable_to_graph(W, to_graph, namespace, copied_variables)



Ofcourse, if your code has a lot of variables, you could just store them all in a list and run the above function over all of them with a common dictionary for copied variables.

3) You then recursively copy all the computational nodes (ops, Placeholders) to the other graph. Now heres the nice things about the method- You only need to do it for the topmost node computational node. All connected inputs and Tensors are automatically taken care of!

For the above example, the train object constructed on line 16 is the ‘topmost’ node. So copying the whole learner is as simple as:

train_copy = copy_to_graph(train, to_graph, copied_variables, namespace)


Thats it! The other instances like y, optimizer, loss are automatically replicated in to_graph.

Theres also a helper function in case you want to find the equivalent of an element from the original graph, in to_graph:


loss_copy = get_copied(loss, to_graph, copied_variables, namespace)



4) You can now run the new node in to_graph. Remember to initialize a new Session instance linked to the graph, and initialize all Variables. So heres how you would go about it:


with to_graph.as_default():
init1 = tf.initialize_all_variables()
sess1 = tf.Session()
sess1.run(init1)

for step in xrange(0, 201):
sess1.run(train_copy)
if step % 20 == 0:
print step, sess1.run(W1), sess1.run(b1)



This provides an output similar to what you would get from the Get Started original example.

The Code


import tensorflow as tf
from tensorflow.python.framework import ops
from copy import deepcopy

def copy_variable_to_graph(org_instance, to_graph, namespace,
copied_variables={}):
"""
Copies the Variable instance 'org_instance' into the graph
'to_graph', under the given namespace.
The dict 'copied_variables', if provided, will be updated with
mapping the new variable's name to the instance.
"""

if not isinstance(org_instance, tf.Variable):
raise TypeError(str(org_instance) + " is not a Variable")

#The name of the new variable
if namespace != '':
new_name = (namespace + '/' +
org_instance.name[:org_instance.name.index(':')])
else:
new_name = org_instance.name[:org_instance.name.index(':')]

#Get the collections that the new instance needs to be added to.
#The new collections will also be a part of the given namespace,
#except the special ones required for variable initialization and
#training.
collections = []
for name, collection in org_instance.graph._collections.items():
if org_instance in collection:
if (name == ops.GraphKeys.VARIABLES or
name == ops.GraphKeys.TRAINABLE_VARIABLES or
namespace == ''):
collections.append(name)
else:
collections.append(namespace + '/' + name)

#See if its trainable.
trainable = (org_instance in org_instance.graph.get_collection(
ops.GraphKeys.TRAINABLE_VARIABLES))
#Get the initial value
with org_instance.graph.as_default():
temp_session = tf.Session()
init_value = temp_session.run(org_instance.initialized_value())

#Initialize the new variable
with to_graph.as_default():
new_var = tf.Variable(init_value,
trainable,
name=new_name,
collections=collections,
validate_shape=False)

copied_variables[new_var.name] = new_var

return new_var

def copy_to_graph(org_instance, to_graph, copied_variables={}, namespace=""):
"""
Makes a copy of the Operation/Tensor instance 'org_instance'
for the graph 'to_graph', recursively. Therefore, all required
structures linked to org_instance will be automatically copied.
'copied_variables' should be a dict mapping pertinent copied variable
names to the copied instances.

The new instances are automatically inserted into the given 'namespace'.
If namespace='', it is inserted into the graph's global namespace.
However, to avoid naming conflicts, its better to provide a namespace.
If the instance(s) happens to be a part of collection(s), they are
are added to the appropriate collections in to_graph as well.
For example, for collection 'C' which the instance happens to be a
part of, given a namespace 'N', the new instance will be a part of
'N/C' in to_graph.

Returns the corresponding instance with respect to to_graph.

TODO: Order of insertion into collections is not preserved
"""

#The name of the new instance
if namespace != '':
new_name = namespace + '/' + org_instance.name
else:
new_name = org_instance.name

#If a variable by the new name already exists, return the
#correspondng tensor that will act as an input
if new_name in copied_variables:
copied_variables[new_name].name)

#If an instance of the same name exists, return appropriately
try:
allow_tensor=True,
allow_operation=True)
except:
pass

#Get the collections that the new instance needs to be added to.
#The new collections will also be a part of the given namespace.
collections = []
for name, collection in org_instance.graph._collections.items():
if org_instance in collection:
if namespace == '':
collections.append(name)
else:
collections.append(namespace + '/' + name)

#Take action based on the class of the instance

if isinstance(org_instance, tf.python.framework.ops.Tensor):

#If its a Tensor, it is one of the outputs of the underlying
#op. Therefore, copy the op itself and return the appropriate
#output.
op = org_instance.op
new_op = copy_to_graph(op, to_graph, copied_variables, namespace)
output_index = op.outputs.index(org_instance)
new_tensor = new_op.outputs[output_index]
for collection in collections:

return new_tensor

elif isinstance(org_instance, tf.python.framework.ops.Operation):

op = org_instance

#If it has an original_op parameter, copy it
if op._original_op is not None:
new_original_op = copy_to_graph(op._original_op, to_graph,
copied_variables, namespace)
else:
new_original_op = None

#If it has control inputs, call this function recursively on each.
new_control_inputs = [copy_to_graph(x, to_graph, copied_variables,
namespace)
for x in op.control_inputs]

#If it has inputs, call this function recursively on each.
new_inputs = [copy_to_graph(x, to_graph, copied_variables,
namespace)
for x in op.inputs]

#Make a new node_def based on that of the original.
#An instance of tensorflow.core.framework.graph_pb2.NodeDef, it
#stores String-based info such as name, device and type of the op.
#Unique to every Operation instance.
new_node_def = deepcopy(op._node_def)
#Change the name
new_node_def.name = new_name

#Copy the other inputs needed for initialization
output_types = op._output_types[:]
input_types = op._input_types[:]

#Make a copy of the op_def too.
#Its unique to every _type_ of Operation.
op_def = deepcopy(op._op_def)

#Initialize a new Operation instance
new_op = tf.python.framework.ops.Operation(new_node_def,
to_graph,
new_inputs,
output_types,
new_control_inputs,
input_types,
new_original_op,
op_def)
#Use Graph's hidden methods to add the op
to_graph._record_op_seen_by_control_dependencies(new_op)
for device_function in reversed(to_graph._device_function_stack):
new_op._set_device(device_function(new_op))

return new_op

else:
raise TypeError("Could not copy instance: " + str(org_instance))

def get_copied(original, graph, copied_variables={}, namespace=""):
"""
Get a copy of the instance 'original', present in 'graph', under
the given 'namespace'.
'copied_variables' is a dict mapping pertinent variable names to the
copy instances.
"""

#The name of the copied instance
if namespace != '':
new_name = namespace + '/' + original.name
else:
new_name = original.name

#If a variable by the name already exists, return it
if new_name in copied_variables:
return copied_variables[new_name]

return graph.as_graph_element(new_name, allow_tensor=True,
allow_operation=True)



Working with feeding is pretty simple too:


>>> x = tf.placeholder("float")
>>> a = tf.constant(3, "float")
>>> namespace = "CopiedOps"
>>> to_graph = tf.Graph()
>>> copied_variables = {}
>>> y1 = copy_to_graph(y, to_graph, namespace)
>>> x1 = get_copied(x, to_graph, namespace)
>>> with to_graph.as_default():
sess = tf.Session()
print sess.run(y1, feed_dict={x1: 5})

8.0



I guess thats all for my hacking around with TensorFlow for the week. If you intend to use this code, please note that it may not be perfect at doing what it says. I haven’t tried it out with all sorts of TensorFlow data structures as yet, so be open to getting an Exception or two that you may have to fix. Infact, do drop me a comment or mail so I can make this code as fool-proof as I can. Cheers!