 The Nelder-Mead method is a heuristic optimization technique, just like Genetic Algorithms or Particle Swarms. Optimization in this context refers to the problem of finding point(s) with the optimal value of an objective function in a search space. Optimal could mean a maximum or minimum value.

Whenever possible, we try to solve an optimization problem with an analytical method. For example, if you have the exact definition of $f(x)$ as a differentiable function of $x$, you could use simple derivatives to compute the value of $x$ at which $f$ attains the optimal value. However, in many real-world problems, we don’t have such a crystal-clear notion of the function we are optimizing. A simple example would be the value of a stock market portfolio as a function of the stocks (and their weightages) in the portfolio, and time. Here, you have no way to define how the value of the overall portfolio is going to behave, and therefore analytical methods don’t make sense.

Heuristic optimization techniques try to remedy this by having a degree of randomness to them. This means that the same algorithm run with the same problem and the same initial conditions could yield different results with each run. Heuristic methods explore the search space stochastically, and keep a note of the ‘best’/most optimal points they come across on the way. Of course, the behaviour is not totally random, but guided by a set of rules.

The most important thing to remember is that heuristic methods never guarantee that they will find the most optimal solution. They give as an output the best point they came across during their search – which might have been a local optimum for all you know. But in practical scenarios, we usually concern ourselves with finding good solutions anyways- we don’t need them to be the best. In fact, finding the best solution with absolute certainty might be impossible/impractical in most cases (think of the stock market portfolio example again).

The Nelder-Mead method uses a geometrical shape called a simplex as its ‘vehicle’ of sorts to search the domain. This is why the technique is also called the Simplex search method. In layman’s terms, a simplex is the n-dimensional version of a ‘triangle’.

For 1 dimension, its a line. For 2 dimensions, its a triangle. For 3 dimensions, its a tetrahedron. And so on…

The shape doesn’t have to be symmetrical/equilateral in any way. Note that an n-dimensional simplex has (n+1) vertexes.

The Algorithm

Nelder-Mead starts off with a randomly-generated simplex (We will discuss how, later). At every iteration, it proceeds to reshape/move this simplex, one vertex at a time, towards an optimal region in the search space. During each step, it basically tries out one or a few modifications to the current simplex, and chooses one that shifts it towards a ‘better’ region of the domain. In an ideal case, the last few iterations of this algorithm would involve the simplex shrinking inwards towards the best point inside it. At the end, the vertex of the simplex that yields that most optimal objective value, is returned.

The following video of a 2-D Nelder-Mead optimization should give you an intuitive understanding of what happens:

How does it work?

Lets assume we are dealing with an $n$-dimensional space. The current simplex consists of the following $n + 1$ points: $x_1$, $x_2$, …, $x_{(n + 1)}$. The function we are trying to optimize(maximize) is $f(x)$. The algorithm takes the following steps during every iteration:

I. Ordering

All the points are ordered/sorted, such that the value of $f$ for the first point is the lowest, and that for the last point is the highest. Let the indices of the first(worst), second(second worst) and last(best) points be $h, s, l$ respectively.

II. Computing the Centroid

Consider all points except the worst ( $x_h$). Compute their centroid(mean) as: $c = \frac{1}{n} \sum_{i \neq h} x_i$

III. Transformation

This is the most important part of the whole process – transforming the simplex. Transformation takes place in one of the following ways:

i. Reflection

This is the first type of transformation that is tried out. You compute the ‘reflected’ point as: $x_r = c + \alpha (c - x_h)$ $\alpha$ is called the reflection parameter, and is usually equal to 1. $x_r$ is essentially a point on the line joining $c$ and $x_h$, but located away from $x_h$. This is done in an attempt to move the simplex in a direction thats away from the sub-optimal region around $x_h$. For a 2-D problem, heres what reflection looks like: Now, if $f(x_s) < f(x_r) \leq f(x_l)$ – that is, that $x_r$ is better than the second-worst point, but not better than the current best, we just replace $x_h$ with $x_r$ in the simplex, and move on to the next iteration.

ii. Expansion

If our luck is great and the reflected point $x_r$ happens to be better than the current best ( $f(x_r) > f(x_l)$), we try to explore the possibility further. In other words, we move a little bit more in the direction of $x_r$ from $c$, in the hope of finding an even better solution. The expanded point is defined as: $x_e = c + \gamma (x_r - c)$ $\gamma$ is called the expansion parameter, and its value in most implementations is 2. Heres what 2-D Expansion looks like: We then replace $x_h$ with the better of the two points: $x_e$ and $x_r$ in the simplex. This is called ‘greedy optimization‘ – we have replaced the worst point with the better of the two options we had. Theres another variant called ‘greedy expansion‘, which replaces $x_h$ with $x_e$, as long as $f(x_e) > f(x_l)$. This is done irrespective of whether $x_e$ is better than $x_r$. The intuition for this, comes from the fact that Expansion always leads to a bigger simplex, which means more exploration of the search space.

iii. Contraction

Suppose our reflection point was worst than $x_s$ (i.e. the second worst point). In that case, we need to realise that the direction defined by $x_r$ may not be the ideal one for moving. Therefore, we end up contracting our simplex.

The contraction point $x_c$ is defined as: $x_c = c + \beta (x_h - c)$ $\beta$ is called the contraction parameter, and is usually equal to 0.5. In essence, we try to go against our exploration that we tried with $x_r$, instead contracting the simplex ‘inwards’. A 2-D contraction looks like this: If $f(x_c) > f(x_h)$, which means that the contracted point is better than the current-worst, then we replace $x_h$ with $x_c$ in the simplex. However, if this is not the case, then we go to our last resort: the Shrink contraction.

iv. Shrink contraction

In this case, we redefine the entire simplex. We only keep the best point ( $x_l$), and define all others with respect to it and the previous points. The $j$th new point, will now be defined as: $x_j = x_l + \delta (x_j - x_l)$ $\delta$ is called the shrinkage parameter, and is equal to 0.5 in most implementations. What we are essentially doing with the above definition, is moving each point in the simplex towards the current best point, in the hope of converging onto the best neighbourhood. A 2-D Shrinking transformation looks like: It is important to note that this transformation is the most expensive of all the transformations, since we have to replace multiple (all but one) points in the simplex. However, it has been found (after multiple experiments), that the shrink transformation rarely happens in practice. As a result, many implementations of Nelder-Mead simply skip this step, doing the Contraction instead.

Termination

Termination is usually reached when:

1. A given number of iterations is reached

2. The simplex reaches some minimum ‘size’ limit.

3. The current best solution reaches some favorable limit.

The Initial Simplex

Like in the case of most heuristic techniques, theres no ‘best’ way to do the random initialization for Simplex search. However, heres one that supposedly works well (it is used in the MATLAB implementation of Nelder-Mead):

The initial point, $x1$, is provided by the user if he has some idea about a possible good solution. If not, its taken to be some random point. This is called the ‘guess‘ vertex. Now, remember we have n dimensions, and we need (n + 1) points – out of which 1 has already been decided. The other end points are initialized with respect to $x_1$, at a small distance along the direction of each dimension.

The (i + 1)th point $x_{(i + 1)}$, as defined using the $i$th dimension unit vector $u_i$, is given by: $x_{(i + 1)} = x_1 + h(x_1, i) * u_i$ $h(x_1, i)$ is defined as: $h(x_1, i) = 0.05$ if the coefficient of $u_i$ in the definition of $x_1$ is non-zero. $h(x_1, i) = 0.00025$ if the coefficient of $u_i$ in the definition of $x_1$ is zero.

Other Points

1. Nelder-Mead is generally used for parameter selection in Machine Learning. In essence, data scientists use the Simplex Method to figure out the optimum parameters for their models. This is mainly because this method tends to optimize the objective function fairly fast, and in an efficient way (especially in implementations where Shrinking is not used).

2. Even though Nelder-Mead tends to optimize the objective fairly fast (with few iterations), it tends to get stuck in local optima. In such cases, it may remain stuck in a sub-optimal neighbourhood for a long time, and the only solution to this is to restart the algorithm. Why this happens is fairly easy to understand. Once the simplex enters a region of local optimum, it will only keep contracting/shrinking, instead of randomly exploring other(far-away) parts of the search space. As a result, its becomes impossible for the algorithm to explore any further.

References

1. rasmusab says:

Very clear and well explained, thanks! 😀

2. Boris Gorelik says:

When you talk about parameter selection, do you encode the selection using a binary vector?

1. srjoglekar246 says:

Not necessarily. The way I understand it, you could be optimizing something like a (learning rate, default neighbourhood value). In that case, the vector could even consist of floating point numbers.

1. Boris Gorelik says:

Oh, that makes sense, as binary selection of parameters will lead to a too irregular landscape which, in turn, will make it impossible to escape local minima with this kind of an algorithm