The basics of Likelihood

Anyone who has done some course in statistics or data science, must have come across the term ‘likelihood’. In non-technical language, likelihood is synonymous with probability. But ask any mathematician, and their interpretation of the two concepts is significantly different. I went digging into likelihood for a bit this week, so thought of putting down the basics of what I revisited.

Whenever we talk about understanding data, we talk about models. In statistics, a model is usually some sort of parameterized function – like a probability density function (pdf) or a regression model. But the effectiveness of the model’s outputs will only be as good as its fit to the data. If the characteristics of the data are very different from the assumed model, the bias turns out to be pretty high. So from a probability perspective, how do we quantify this fit?

Defining the Likelihood Function

The two entities of interest are – the data X, and the parameters \theta. Now consider a function F(X, \theta), that returns a number proportional to the degree of ‘fit’ between the two – essentially quantifying their relationship with each other.

There are two practical ways you could deal with this function. If you kept \theta constant and varied the data X being analyzed, you would be getting a function F_1(X) whose only argument is the data. The output would basically be a measure of how well your input data satisfies the assumptions made by the model.

But in real life, you rarely know your model with certainty. What you do have, is a bunch of observed data. So shouldn’t we also think of the other way round? Suppose you kept X constant, but tried varying \theta instead. Now, what you have got is a function F_2(\theta), that computes how well different sets of parameters describe your data (which you have for real).

Mind you, in both cases, the underlying mathematical definition is the same. The input ‘variable’ is what has changed. This is how probability and likelihood are related. The function F_1 is what we call the probability function (or pdf for the continuous case), and F_2 is called the likelihood function. While F_1 assumes you know your model and tries to analyze data according to it, F_2 keeps the data in perspective while figuring out how well different sets of parameters describe it.

The above definition might make you think that the likelihood is nothing but a rewording of probability. But keeping the data constant, and varying the parameters has huge consequences on the way you interpret the resultant function.

Lets take a simple example. Consider you have a set C_n of n different coin tosses, where r out of them were Heads, while the others were Tails. Lets say that the coin used for tossing was biased, and the probability of a Heads coming up on it is p. In this case,

F(C_n, p) = {n \choose r} p^r (1-p)^{(n - r)}

Now suppose you made coin yourself, so you know p = 0.7. In that case,

F_1(C_n) = {n \choose r} 0.7^r 0.3^{(n - r)}

On the other hand, lets say you don’t know much about the coin, but you do have a bunch of toss-outcomes from it. You made 10 different tosses, out which 5 were Heads. From this data, you want to measure how likely it is that your guess of p is correct. Then,

F_2(p) = 252 p^5 (1-p)^5

There is a very, very important distinction between probability and likelihood functions – the value of the probability function sums (or integrates, for continuous data) to 1 over all possible values of the input data. However, the value of the likelihood function does not integrate to 1 over all possible combinations of the parameters.

The above statement leads to the second important thing to note: DO NOT interpret the value of a likelihood function, as the probability of the model parameters. If your probability function gave the value of 0.7 (say) for a discrete data point, you could be pretty sure that there would be no other option as likely as it. This is because, the sum of the probabilities of all other point would be equal to 0.3. However, if you got 0.99 as the output of your likelihood function, it wouldn’t necessarily mean that the parameters you gave in are the most likely ones. Some other set of parameters might give 0.999, 0.9999 or something higher.

The only thing you can be sure of, is this: If F_2(\theta_1) >F_2(\theta_2), then it is more likely that \theta_1 denote the parameters of the underlying model.

Log Likelihoods for Maximum Likelihood Estimation

The likelihood function is usually denoted as L(\theta | x) (likelihood L of the parameters \theta given the data point x), so we will stick with it from now on. The most common use of likelihood, is to figure out that set of parameters which yields the highest value for it (and thus describes your dataset the best). This method is called Maximum Likelihood Estimation. You maximize the value of the likelihood function in a bid to find the optimal parameters for your model. This trick is applied in many areas of data science, such as logistic regression.

Maximum Likelihood Estimation usually involves computing the partial derivative of the likelihood function with respect to the parameters. We generally deal with the log-likelihood (basically the logarithm of the likelihood function) rather than likelihood itself. Since log is a monotonically increasing function, the optimum value of the likelihood function can be calculated using derivatives of log-likelihood as well. The reason we use logarithms, is to make the process of dealing with derivatives easy. Consider the coin-toss example I gave above:

Your Likelihood function for the probability of Heads, given the n and r, was:

L(p | r) = {n \choose r} p^r (1-p)^{(n - r)}

Computing log, we get

log(L(p | r)) = log ({n \choose r}) + r log(p) + (n - r) log(1 - p)

To maximise, we will compute the partial derivative with respect to p, and equate to zero.

Using \frac{d(log x)}{dx} = \frac{1}{x} we get,

\frac{r}{p} = \frac{n-r}{1-p}

Solving, we get the intuitive result:

p = \frac{r}{n}

In most cases, when you compute likelihood, you would be dealing with a bunch of independent data points x_1, x_2, ..., x_n, rather than a single one. The likelihood of \theta with respect to the data-set X then gets defined as follows:

L(\theta | X) = L(\theta | x_1) L(\theta | x_2) L(\theta | x_3) ... L(\theta | x_n)

Using log, the overall likelihood becomes:

log(L(\theta | X)) = \sum_x{log(L(\theta | x))}

Nelder-Mead Optimization


The Nelder-Mead method is a heuristic optimization technique, just like Genetic Algorithms or Particle Swarms. Optimization in this context refers to the problem of finding point(s) with the optimal value of an objective function in a search space. Optimal could mean a maximum or minimum value.

Whenever possible, we try to solve an optimization problem with an analytical method. For example, if you have the exact definition of f(x) as a differentiable function of x, you could use simple derivatives to compute the value of x at which f attains the optimal value. However, in many real-world problems, we don’t have such a crystal-clear notion of the function we are optimizing. A simple example would be the value of a stock market portfolio as a function of the stocks (and their weightages) in the portfolio, and time. Here, you have no way to define how the value of the overall portfolio is going to behave, and therefore analytical methods don’t make sense.

Heuristic optimization techniques try to remedy this by having a degree of randomness to them. This means that the same algorithm run with the same problem and the same initial conditions could yield different results with each run. Heuristic methods explore the search space stochastically, and keep a note of the ‘best’/most optimal points they come across on the way. Of course, the behaviour is not totally random, but guided by a set of rules.

The most important thing to remember is that heuristic methods never guarantee that they will find the most optimal solution. They give as an output the best point they came across during their search – which might have been a local optimum for all you know. But in practical scenarios, we usually concern ourselves with finding good solutions anyways- we don’t need them to be the best. In fact, finding the best solution with absolute certainty might be impossible/impractical in most cases (think of the stock market portfolio example again).

The Nelder-Mead method uses a geometrical shape called a simplex as its ‘vehicle’ of sorts to search the domain. This is why the technique is also called the Simplex search method. In layman’s terms, a simplex is the n-dimensional version of a ‘triangle’.

For 1 dimension, its a line.


For 2 dimensions, its a triangle.


For 3 dimensions, its a tetrahedron.


And so on…

The shape doesn’t have to be symmetrical/equilateral in any way. Note that an n-dimensional simplex has (n+1) vertexes.

The Algorithm

Nelder-Mead starts off with a randomly-generated simplex (We will discuss how, later). At every iteration, it proceeds to reshape/move this simplex, one vertex at a time, towards an optimal region in the search space. During each step, it basically tries out one or a few modifications to the current simplex, and chooses one that shifts it towards a ‘better’ region of the domain. In an ideal case, the last few iterations of this algorithm would involve the simplex shrinking inwards towards the best point inside it. At the end, the vertex of the simplex that yields that most optimal objective value, is returned.

The following video of a 2-D Nelder-Mead optimization should give you an intuitive understanding of what happens:

How does it work?

Lets assume we are dealing with an n-dimensional space. The current simplex consists of the following n + 1 points: x_1, x_2, …, x_{(n + 1)}. The function we are trying to optimize(maximize) is f(x). The algorithm takes the following steps during every iteration:

I. Ordering

All the points are ordered/sorted, such that the value of f for the first point is the lowest, and that for the last point is the highest. Let the indices of the first(worst), second(second worst) and last(best) points be h, s, l respectively.

II. Computing the Centroid

Consider all points except the worst (x_h). Compute their centroid(mean) as:

c = \frac{1}{n} \sum_{i \neq h} x_i

III. Transformation

This is the most important part of the whole process – transforming the simplex. Transformation takes place in one of the following ways:

i. Reflection

This is the first type of transformation that is tried out. You compute the ‘reflected’ point as:

x_r = c + \alpha (c - x_h)

\alpha is called the reflection parameter, and is usually equal to 1. x_r is essentially a point on the line joining c and x_h, but located away from x_h. This is done in an attempt to move the simplex in a direction thats away from the sub-optimal region around x_h. For a 2-D problem, heres what reflection looks like:

NelderMead_1Now, if f(x_s) < f(x_r) \leq f(x_l) – that is, that x_r is better than the second-worst point, but not better than the current best, we just replace x_h with x_r in the simplex, and move on to the next iteration.

ii. Expansion

If our luck is great and the reflected point x_r happens to be better than the current best (f(x_r) > f(x_l)), we try to explore the possibility further. In other words, we move a little bit more in the direction of x_r from c, in the hope of finding an even better solution. The expanded point is defined as:

x_e = c + \gamma (x_r - c)

\gamma is called the expansion parameter, and its value in most implementations is 2. Heres what 2-D Expansion looks like:


We then replace x_h with the better of the two points: x_e and x_r in the simplex. This is called ‘greedy optimization‘ – we have replaced the worst point with the better of the two options we had. Theres another variant called ‘greedy expansion‘, which replaces x_h with x_e, as long as f(x_e) > f(x_l). This is done irrespective of whether x_e is better than x_r. The intuition for this, comes from the fact that Expansion always leads to a bigger simplex, which means more exploration of the search space.

iii. Contraction

Suppose our reflection point was worst than x_s (i.e. the second worst point). In that case, we need to realise that the direction defined by x_r may not be the ideal one for moving. Therefore, we end up contracting our simplex.

The contraction point x_c is defined as:

x_c = c + \beta (x_h - c)

\beta is called the contraction parameter, and is usually equal to 0.5. In essence, we try to go against our exploration that we tried with x_r, instead contracting the simplex ‘inwards’. A 2-D contraction looks like this:

NelderMead_3If f(x_c) > f(x_h), which means that the contracted point is better than the current-worst, then we replace x_h with x_c in the simplex. However, if this is not the case, then we go to our last resort: the Shrink contraction.

iv. Shrink contraction

In this case, we redefine the entire simplex. We only keep the best point (x_l), and define all others with respect to it and the previous points. The jth new point, will now be defined as:

x_j = x_l + \delta (x_j - x_l)

\delta is called the shrinkage parameter, and is equal to 0.5 in most implementations. What we are essentially doing with the above definition, is moving each point in the simplex towards the current best point, in the hope of converging onto the best neighbourhood. A 2-D Shrinking transformation looks like:


It is important to note that this transformation is the most expensive of all the transformations, since we have to replace multiple (all but one) points in the simplex. However, it has been found (after multiple experiments), that the shrink transformation rarely happens in practice. As a result, many implementations of Nelder-Mead simply skip this step, doing the Contraction instead.


Termination is usually reached when:

1. A given number of iterations is reached

2. The simplex reaches some minimum ‘size’ limit.

3. The current best solution reaches some favorable limit.

The Initial Simplex

Like in the case of most heuristic techniques, theres no ‘best’ way to do the random initialization for Simplex search. However, heres one that supposedly works well (it is used in the MATLAB implementation of Nelder-Mead):

The initial point, x1, is provided by the user if he has some idea about a possible good solution. If not, its taken to be some random point. This is called the ‘guess‘ vertex. Now, remember we have n dimensions, and we need (n + 1) points – out of which 1 has already been decided. The other end points are initialized with respect to x_1, at a small distance along the direction of each dimension.

The (i + 1)th point x_{(i + 1)}, as defined using the ith dimension unit vector u_i, is given by:

x_{(i + 1)} = x_1 + h(x_1, i) * u_i

h(x_1, i) is defined as:

h(x_1, i) = 0.05 if the coefficient of u_i in the definition of x_1 is non-zero.

h(x_1, i) = 0.00025 if the coefficient of u_i in the definition of x_1 is zero.

Other Points

1. Nelder-Mead is generally used for parameter selection in Machine Learning. In essence, data scientists use the Simplex Method to figure out the optimum parameters for their models. This is mainly because this method tends to optimize the objective function fairly fast, and in an efficient way (especially in implementations where Shrinking is not used).

2. Even though Nelder-Mead tends to optimize the objective fairly fast (with few iterations), it tends to get stuck in local optima. In such cases, it may remain stuck in a sub-optimal neighbourhood for a long time, and the only solution to this is to restart the algorithm. Why this happens is fairly easy to understand. Once the simplex enters a region of local optimum, it will only keep contracting/shrinking, instead of randomly exploring other(far-away) parts of the search space. As a result, its becomes impossible for the algorithm to explore any further.

3. Scipy implementation of Nelder-Mead.


  1. Scholarpedia article
  2. Wikipedia article
  3. Jeff Heaton’s book
  4. This StackOverflow answer

Fuzzy Speciation in Genetic Algorithms using k-d Trees

You would need to know the basics of Genetic Algorithms to understand this post. AI Junkie’s tutorial is a good place to get to upto speed. You would also need to know what k-D Trees are, for which the Wiki article is a good reference.


What is speciation?

Speciation, in simple terms, is an evolutionary process by which new biological species arise. This concept is used in the implementation of evolutionary algorithms (such as GAs) as a form of optimization. To understand speciation, its necessary to understand what we mean by ‘species’.

A species is a group of organisms that – 1) possess some common prominent traits, and 2) can breed/mate with one-another. For example, consider cats and leopards. They share many characteristics, and hence exist in the same ‘family’ of animals. However, since they can’t really mate to provide a cat+leopard offspring, they don’t belong to the same species. The part 2) of the above definition may not be applicable in all cases (such as asexual plants). Nevertheless, it is what is important to us in the context of evolutionary algorithms.

Basically, a genetic algorithm (that implements speciation) will divide the population of candidates into a number of species. Essentially, each species will be a group of solutions that are allowed to crossover with each other. Candidates belonging to different species rarely mate.

Why do we need speciation?

A Genetic Algorithm chiefly deals with two genetic operations: Crossover and Mutation. Mutation works on a single candidate solution by perturbing it in some small way. Crossover, on the other hand, takes two candidate ‘parents’ (selected based on their fitness) and produces offspring(s) that share parts of their genome with the parents. This seems good enough, but it has some serious pitfalls.

Many times, crossover has a pretty high rate of failure (that is, it produces very inferior solutions). The reason for this, in simple terms, is: though both parents may individually have very high levels of fitness, the result of combining random parts of their genome together can be sub-optimal. This tends to happen when the parents come from distant parts of the search space, causing the offspring to lie in some entirely new region that has no optima in the vicinity. For example, imagine you are trying to evolve the fastest mammal. During a crossover, you select a deer (a fast mammal), and a cheetah(the fastest indeed). In this case, even if you could breed their offspring, it may have a random mixture of the anatomical features of its parents thats disastrous from the point of view of speed. In other words, mating fit parents does not guarantee a fit offspring, if the properties of the parents are vastly different.

In cases like these, speciation comes to the rescue. By restricting mating between greatly dissimilar candidates, evolution of sub-par candidates is minimized.

How is speciation implemented?

Usually, the two most common ways of implementing speciation in evolutionary algorithms are: Threshold Speciation and K-Means/Medoids clustering. Both are explained pretty well in Jeff Heaton’s book on Nature-Inspired algorithms in Artificial Intelligence.

The key feature of speciation is the fact that it prevents mating among candidates with highly differing characteristics. This can be thought of as promoting breeding between candidates that lie close to each other in the search-space. To achieve the same effect, I have been using k-d trees to perform nearest-neighbour searches to provide a form of speciation to my implementation of GAs. Its pretty simple(and intuitive) in terms of its approach, doesn’t degrade performance by too much and has been providing good results so far. So, here’s a basic outline of it for you all.

Implementation using k-D Trees

I will outline the algorithm first, and point out the nuances later.

First off, you must have a way of computing the ‘distance’/dissimilarity between candidate solutions. If your genome has a fixed length, you can use Euclidean distance.

Extra attributes required: 1. Starting Temperature (T), and 2. Neighbourhood Value(V). T must be in the range (0. 1). The higher this value, the greater is the tendency to neglect speciation (especially in the earlier iterations). The Neighbourhood Value V denotes the number of nearest candidates that will be considered in searches.


I) The following selection process is repeated in each iteration until the new population size does not reach the required value:

Initial Values: Temperature t = T

i. Pick a random candidate parent based on fitness (using something like Roulette-Wheel selection, or Tournament selection)

ii.a. Pick a random number in (0, 1). If it is lesser than the current value of t, pick another parent randomly using the same selection process used in step i. If not, goto step ii.b.

ii.b. Using the k-D Tree trained over candidates from the current iteration, find the V nearest candidates to the parent selected in step i. Choose one randomly (using the process employed in step i), based on fitness.

iii. Use crossover with parents produced in steps i and ii. Mutate as/when appropriate. Add offspring to the pool of candidates of the next generation.

II) Once the new population grows to the required size, a new k-D Tree is generated using the set of all new candidates for the next iteration.

III) Decrease t for the next iteration.


1) First of all, you need to have a way to compute dissimilarity between candidate solutions (apart from their fitness). As mentioned, if your genome size is constant, you could always use good-old Euclidean distance. If not, you might have to be creative. If you are using Scikit-Learn’s implementation of k-D Trees like me, you could check out the metrics it supports by doing kd_tree.valid_metrics over an instance.

2) Because the selection of the first parent in step i. depends only by the global measure for fitness, parts of the search space that generate good solutions will have higher numbers of candidates in subsequent iterations.

3) Unlike using K-Medoids for speciation, you don’t need to have a strict definition of the species that a given candidate belongs to. Nearest-neighbour searching ensures that a candidate usually mates only with other candidates that are similar to it.

4) As you might have understood, the Temperature parameter only adds some amount of exploratory tendencies to the algorithm. The higher the temperature, the greater is the tendency to ignore speciation. The decrease in t could most likely be done linearly from the user-provided starting point T, to a pre-defined minimum value(like 0.05), across iterations.

Effects on Performance of the GA

As mentioned in the discussion here, it doesn’t make much sense to analytically define the time-complexity of GAs. However, let me give you a rough idea of how things will change. Assume the total size of the population is n. First off, Roulette-Wheel selection (done in step i) takes O(n) time if done in a naive way (going through all candidates in some order one by one). If you do it using the stochastic acceptance method, it takes O(1) time. Usually, this selection will be done twice during every crossover. In case of the algorithm outlined above, for approximately (1-t) fraction of the second crossovers, the Roulette-wheel selection will be replaced by an O(log(n)) search using the k-D Tree and a much smaller (almost O(1)) Roulette Wheel selection amongst the nearest V candidates. Hence, unless your population size is very, very large, this won’t affect time that drastically.

The only significant increase in time will be seen as a result of the k-D Tree learning at the end of each iteration, which would add an O(nlog(n)) time to the time required for generating the next-gen pool of candidates (which is O(n^2) for naive implementations anyways). Since O(n^2 + nlog n) is equivalent to O(n^2), the performance decrease isn’t that significant, provided the size of your population isn’t really high. Even if you were using a highly efficient selection procedure during crossovers, for reasonable population sizes, your total run-time would only be multiplied by a factor less than 2 (mainly because crossover and mutation by themselves take some time too).

Thats all for now! Keep reading and cheers 🙂

Using Circular Self-Organizing Maps to solve the Symmetric TSP


1. Knowing what the Traveling Salesman Problem (TSP) is. The wiki article is a good place to start. The symmetrical form of the problem is where the distance from one city to another is the same in both directions.


2. A decent understanding of what Kohonen/Self-Organizing Maps are. This blog post will point you in the right direction to learn more.

After I wrote my former blog post on implementing SOMs with TensorFlow, someone on Reddit mentioned that there are ways to solve TSPs using them. When I did some research, it appeared that there have been a few attempts to do so. Heres one, and heres another. So in my spare time, I decided to try and have a crack at it myself.

I will describe my approach (along with the code) in a short while, but let me give a disclaimer first. This post in no way claims to have found the ‘best’ way to solve TSPs (ACOs are pretty much the state of the art). It just describes my attempt, and the results I obtained (which were interesting, with respect to the methodology implemented).

I obviously make a few assumptions, and here they are:

1. Every city has a corresponding vector available, denoting its coordinates/location in a 2D/3D space. This source presents many TSP benchmark data-sets in this format. I also tried assigning every city a vector where every dimension is its distance from all the cities(including itself). This gives pretty sub-par results, so I decided not to pursue that. Maybe you could use a dimensionality reduction technique like PCA on these ‘distance’ vectors and get good results.

2. Because of the above point, my method is pretty much restricted to solving the symmetric form of a TSP (atleast the way I have done it).


The Method

Suppose the total number of neurons is N. The total number of cities in question is C. After some trial and error, I set N = 2C. It can be higher/lower as well. The lower the number, the less optimal is the understanding of the solution space. And as you go higher, after a certain point, there seems to be no improvement in the output (but your execution time increases many-fold).

The individual neurons are provided locations around a circle, as follows:

l_i = ( cos(\frac{2i\pi}{N}), sin(\frac{2i\pi}{N}) )

where l_i is the location vector for the i th neuron in a 2-D space. i goes from 0 to N - 1. It is important to not confuse this space with the space in which the cities lie (even though both are 2-dimensional). Lets call one the Neuron-Space, and the other, the City-Space.

Because of the circular nature of this SOM, we inherently have a closed loop-structure to our framework (as is needed by the TSP solution), where neurons in proximity of each other in the Neuron-Space lie near-by in the City-Space. Also, do note how every neuron essentially corresponds to an angle ( \frac{2i\pi}{N}) around the circle. This point would be useful soon.

The initialization of the neuron vectors in City-Space is done by using the mean and standard deviation of the dimensions of the city vectors in City-Space. This is because I observed that the scale of all the dimensions in the city-vectors is pretty much the same. If it isn’t, you might have to resort to building a mean vector and covariance matrix. For training, I use \alpha = 0.3 and \sigma = 0.3 as parameters for the SOM. Putting a constant value of \sigma ensures that the same proportion of neurons around the Best-Matching Unit get affected everytime, irrespective of the total number of neurons being used.

Once the SOM trained, we can say we have the rough ‘skeleton’ for a good route in the City-Space. But how do you convert this skeleton to the appropriate route? By using those angles I just talked about. Consider a city c. Lets say the k nearest neurons to it, are n_1, n_2, ..., n_k.

So, we can construct a vector for every city in Neuron-Space, using this simple formula:

l_c = \sum_k \frac{1}{d_{c, n_k}} l_c

where d_{c, n_k} is the distance between city c and neuron n_k in City-Space. Using this vector and some pretty simple (inverse) trigonometry, you can assign the city an angle around the circle on which the SOM-neurons themselves lie in Neuron-Space. Now all you need to do to come up with an approximately good solution, is just sort the cities with respect to the angles they are assigned. Voila!

Ofcourse, the solution that this algorithm gives you would most likely be in the neighbourhood of a slightly better one, so I use a standard TSP local search heuristic like 3-opt over the solution obtained from my algorithm. What 3-opt essentially does, is change around the solution obtained from some other algorithm over so slightly in some pre-defined ways, in the hope of hitting on a better solution. 3-opt is better than 2-opt for this algorithm. I dint try higher-order k-opt searches.

One interesting possibility about this algorithm, is the fact that it could provide great solutions to solving TSPs for different target cities, without retraining the SOM– as long as they lie in the same City-Space. As long as the set of cities you try to solve for follows (atleast approximately) the trends shown by the dataset used for training, all you would need to do is compute their respective angles (which does not involve re-training), and just apply 3-opt. In short, you could use the same SOM, once trained, to solve for multiple subsets of cities.

The Implementation

This code builds on my Tensorflow-based SOM code. Its pretty rough around the edges, especially the 3-opt search. It was meant just as a proof-of-concept, so feel free to hack around if you like.

import tensorflow as tf
import numpy as np

def _distance(tsp_solution, distance_matrix):
    Total distance for a solution.
    total_distance = 0
    for i in range(len(tsp_solution)-1):
        total_distance += (distance_matrix[tsp_solution[i]][
            tsp_solution[i + 1]])
    total_distance += (distance_matrix[tsp_solution[0]][
    return total_distance

def _3_opt(route, distance_matrix):
    Uses 3-opt to find a better solution in the neighbourhood.

    current_dist = _distance(route, distance_matrix)
    #Find three connections to break
    for i in range(len(route) - 1):
        for j in range(i + 2, len(route) - 1):
            for k in range(j + 2, len(route) - 1):
                #Now reconnect the graph in every other way possible,
                #and see if any gives a better solution
                #Way 1
                temp = route[:]
                temp[j+1:k+1] = reversed(route[j+1:k+1])
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 2
                temp = route[:]
                temp[i+1:j+1] = reversed(route[i+1:j+1])
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 3
                temp = route[:]
                temp[i+1:j+1], temp[j+1:k+1] = (
                    reversed(route[i+1:j+1]), reversed(route[j+1:k+1]))
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 4
                temp = (route[:i+1] + route[j+1:k+1] +
                        route[i+1:j+1] + route[k+1:])
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 5
                temp = route[:i+1] + route[j+1:k+1]
                temp += reversed(route[i+1:j+1])
                temp += route[k+1:]
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 6
                temp = route[:i+1]
                temp += reversed(route[j+1:k+1])
                temp += route[i+1:j+1]
                temp += route[k+1:]
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
                #Way 7
                temp = route[:i+1]
                temp += reversed(route[j+1:k+1])
                temp += reversed(route[i+1:j+1])
                temp += route[k+1:]
                dist = _distance(temp, distance_matrix)
                if dist < current_dist:
                    route = temp
                    current_dist = dist
    return route

class TSPSolver(object):
    Uses a Circular Self-Organizing Map with a Gaussian Neighbourhood
    function and linearly decreasing learning rate, to solve TSP.
    #To check if the SOM has been trained
    _trained = False
    def __init__(self, input_vects, n_iterations=200):
        Initializes all necessary components of the TensorFlow

        'input_vects' should be a list of 1-D NumPy vectors denoting
        location vectors for the cities.
        'n_iterations' is an integer denoting the number of iterations
        that would be run while training the SOM.
        #Assign required variables first
        self._n = len(input_vects)
        dim = len(input_vects[0])
        self._n_centroids = self._n*3
        alpha = 0.3
        sigma = 0.3
        self._n_iterations = abs(int(n_iterations))

        #For initialization of the centroid(neuron) vectors
        init_values = []
        for vect in input_vects:
        init_mean = np.mean(init_values)
        init_dev = np.sqrt(np.var(init_values))
        self._graph = tf.Graph()
        with self._graph.as_default():
            #Randomly initialized weightage vectors for all neurons,
            #stored together as a matrix Variable of size [self._n, dim]
            self._weightage_vects = tf.Variable(tf.random_normal(
                [self._n_centroids, dim], init_mean, init_dev))
            #Matrix of size [self._n, 2] for SOM grid locations
            #of neurons
            self._location_vects = tf.constant(np.array(
            #We need to assign them as attributes to self, since they
            #will be fed in during training
            #The training vector
            self._vect_input = tf.placeholder("float", [dim])
            #Iteration number
            self._iter_input = tf.placeholder("float")
            #Only the final, 'root' training op needs to be assigned as
            #an attribute to self, since all the rest will be executed
            #automatically during training
            #To compute the Best Matching Unit given a vector
            #Basically calculates the Euclidean distance between every
            #neuron's weightage vector and the input, and returns the
            #index of the neuron which gives the least value
            bmu_index = tf.argmin(tf.sqrt(tf.reduce_sum(
                tf.pow(tf.sub(self._weightage_vects, tf.pack(
                    [self._vect_input for i in range(
                        self._n_centroids)])), 2), 1)),
            #This will extract the location of the BMU based on the BMU's
            slice_input = tf.pad(tf.reshape(bmu_index, [1]),
                                 np.array([[0, 1]]))
            bmu_loc = tf.reshape(tf.slice(self._location_vects, slice_input,
                                          tf.constant(np.array([1, 2]))),
            #To compute the alpha and sigma values based on iteration
            learning_rate_op = tf.sub(1.0, tf.div(self._iter_input,
            _alpha_op = tf.mul(alpha, learning_rate_op)
            _sigma_op = tf.mul(sigma, learning_rate_op)
            #Construct the op that will generate a vector with learning
            #rates for all neurons, based on iteration number and location
            #wrt BMU.
            bmu_distance_squares = tf.reduce_sum(tf.pow(tf.sub(
                self._location_vects, tf.pack(
                    [bmu_loc for i in range(self._n_centroids)])), 2), 1)
            neighbourhood_func = tf.exp(tf.neg(tf.div(tf.cast(
                bmu_distance_squares, "float32"), tf.pow(_sigma_op, 2))))
            learning_rate_op = tf.mul(_alpha_op, neighbourhood_func)
            #Finally, the op that will use learning_rate_op to update
            #the weightage vectors of all neurons based on a particular
            learning_rate_multiplier = tf.pack([tf.tile(tf.slice(
                learning_rate_op, np.array([i]), np.array([1])), [dim])
                                               for i in range(
            weightage_delta = tf.mul(
                tf.sub(tf.pack([self._vect_input for i in range(
            new_weightages_op = tf.add(self._weightage_vects,
            self._training_op = tf.assign(self._weightage_vects,
            self._sess = tf.Session()
            init_op = tf.initialize_all_variables()

            #Train the Solver

    def solution(self):
        Returns the TSP solution order as a list, and the total
        Euclidean distance with respect to the input vectors.
        if not self._trained:
            raise ValueError("SOM not trained yet")
        return self._sol, self._sol_dist

    def _neuron_locations(cls, n_centroids):
        Yields one by one the 2-D locations of the individual neurons
        in the SOM.

        for i in range(n_centroids):
            yield np.array([np.cos(i*2*np.pi/float(n_centroids)),
    def train(self, input_vects):
        Trains the SOM.
        'input_vects' should be a list of 1-D NumPy arrays with
        dimensionality as provided during initialization of this SOM.
        The points in the solution are enumerated as they are ordered
        in 'input_vects'.
        Current weightage vectors for all neurons(initially random) are
        taken as starting conditions for training.
        #Training iterations
        for iter_no in range(self._n_iterations):
            #Train with each vector one by one
            for input_vect in input_vects:
                               feed_dict={self._vect_input: input_vect,
                                          self._iter_input: iter_no})
        self._input_vects = input_vects
        self._trained = True
    def _compute_solution(self):
        Computes the solution to the TSP with respect to the input
        #TODO: All nearest-neighbour searches could be speeded up
        #significantly with a KD-Tree/Ball Tree.

        #Get the centroid data 
        centroid_vects = list(
        centroid_locations = list(

        #Distance matrix mapping input point number to list of distances
        #from centroids
        distances = {}

        for point in range(self._n):
            distances[point] = []
            for centroid in range(self._n_centroids):
                    np.linalg.norm(centroid_vects[centroid] -

        #Distance matrix mapping input point number to list of distances
        #from other input points
        point_distances = {}

        for point in range(self._n):
            point_distances[point] = []
            for other_point in range(self._n):
                    np.linalg.norm(self._input_vects[other_point] -

        #Compute angle with respect to each city(point)
        point_angles = {}
        for point in range(self._n):
            total_vect = 0
            cents = [j for j in range(self._n_centroids)]
            cents.sort(key=lambda x: distances[point][x])
            for centroid in cents[:2]:
                total_vect += (1.0/(distances[point][centroid]) *
            total_vect = total_vect/np.linalg.norm(total_vect)
            if total_vect[0] > 0 and total_vect[1] > 0:
                angle = np.arcsin(total_vect[1])
            elif total_vect[0] < 0 and total_vect[1] > 0:
                angle = np.arccos(total_vect[0])
            elif total_vect[0] < 0 and total_vect[1] < 0:
                angle = np.pi - np.arcsin(total_vect[1])
                angle = 2*np.pi - np.arccos(total_vect[0])
            point_angles[point] = angle

        #Find the rough solution
        tsp_solution = [i for i in range(self._n)]
        tsp_solution.sort(key=lambda x: point_angles[x])

        tsp_solution = _3_opt(tsp_solution, point_distances)

        #Compute the total distance for the solution
        total_distance = _distance(tsp_solution, point_distances)

        self._sol = tsp_solution
        self._sol_dist = total_distance


Each of these problem datasets can be found at the source here. All the presented results are over 10 runs. The max runtime was experienced with krA100, around one minute.

Berlin-52: Best Error: 0.0%, Average Error: 4.1%.

krA100: Best Error: 1.6%, Average Error: 2.7%.

eli76: Best Error: 2%, Average Error: 4.4%.

These results aren’t ‘close’ to those provided by ACOs, but they are practically feasible, and show a good promise (Oh now I just sound like I am writing a research paper.) Plus, the algorithm isn’t a time-hogger either (judging by runtimes on my laptop).

Anyways. Thanks for reading! Do drop a comment if you think of an improvement over this work.