# Fuzzy Speciation in Genetic Algorithms using k-d Trees

You would need to know the basics of Genetic Algorithms to understand this post. AI Junkie’s tutorial is a good place to get to upto speed. You would also need to know what k-D Trees are, for which the Wiki article is a good reference.

What is speciation?

Speciation, in simple terms, is an evolutionary process by which new biological species arise. This concept is used in the implementation of evolutionary algorithms (such as GAs) as a form of optimization. To understand speciation, its necessary to understand what we mean by ‘species’.

A species is a group of organisms that – 1) possess some common prominent traits, and 2) can breed/mate with one-another. For example, consider cats and leopards. They share many characteristics, and hence exist in the same ‘family’ of animals. However, since they can’t really mate to provide a cat+leopard offspring, they don’t belong to the same species. The part 2) of the above definition may not be applicable in all cases (such as asexual plants). Nevertheless, it is what is important to us in the context of evolutionary algorithms.

Basically, a genetic algorithm (that implements speciation) will divide the population of candidates into a number of species. Essentially, each species will be a group of solutions that are allowed to crossover with each other. Candidates belonging to different species rarely mate.

Why do we need speciation?

A Genetic Algorithm chiefly deals with two genetic operations: Crossover and Mutation. Mutation works on a single candidate solution by perturbing it in some small way. Crossover, on the other hand, takes two candidate ‘parents’ (selected based on their fitness) and produces offspring(s) that share parts of their genome with the parents. This seems good enough, but it has some serious pitfalls.

Many times, crossover has a pretty high rate of failure (that is, it produces very inferior solutions). The reason for this, in simple terms, is: though both parents may individually have very high levels of fitness, the result of combining random parts of their genome together can be sub-optimal. This tends to happen when the parents come from distant parts of the search space, causing the offspring to lie in some entirely new region that has no optima in the vicinity. For example, imagine you are trying to evolve the fastest mammal. During a crossover, you select a deer (a fast mammal), and a cheetah(the fastest indeed). In this case, even if you could breed their offspring, it may have a random mixture of the anatomical features of its parents thats disastrous from the point of view of speed. In other words, mating fit parents does not guarantee a fit offspring, if the properties of the parents are vastly different.

In cases like these, speciation comes to the rescue. By restricting mating between greatly dissimilar candidates, evolution of sub-par candidates is minimized.

How is speciation implemented?

Usually, the two most common ways of implementing speciation in evolutionary algorithms are: Threshold Speciation and K-Means/Medoids clustering. Both are explained pretty well in Jeff Heaton’s book on Nature-Inspired algorithms in Artificial Intelligence.

The key feature of speciation is the fact that it prevents mating among candidates with highly differing characteristics. This can be thought of as promoting breeding between candidates that lie close to each other in the search-space. To achieve the same effect, I have been using k-d trees to perform nearest-neighbour searches to provide a form of speciation to my implementation of GAs. Its pretty simple(and intuitive) in terms of its approach, doesn’t degrade performance by too much and has been providing good results so far. So, here’s a basic outline of it for you all.

Implementation using k-D Trees

I will outline the algorithm first, and point out the nuances later.

First off, you must have a way of computing the ‘distance’/dissimilarity between candidate solutions. If your genome has a fixed length, you can use Euclidean distance.

Extra attributes required: 1. Starting Temperature (T), and 2. Neighbourhood Value(V). T must be in the range (0. 1). The higher this value, the greater is the tendency to neglect speciation (especially in the earlier iterations). The Neighbourhood Value V denotes the number of nearest candidates that will be considered in searches.

Algorithm:

I) The following selection process is repeated in each iteration until the new population size does not reach the required value:

Initial Values: Temperature t = T

i. Pick a random candidate parent based on fitness (using something like Roulette-Wheel selection, or Tournament selection)

ii.a. Pick a random number in (0, 1). If it is lesser than the current value of t, pick another parent randomly using the same selection process used in step i. If not, goto step ii.b.

ii.b. Using the k-D Tree trained over candidates from the current iteration, find the V nearest candidates to the parent selected in step i. Choose one randomly (using the process employed in step i), based on fitness.

iii. Use crossover with parents produced in steps i and ii. Mutate as/when appropriate. Add offspring to the pool of candidates of the next generation.

II) Once the new population grows to the required size, a new k-D Tree is generated using the set of all new candidates for the next iteration.

III) Decrease t for the next iteration.

Pointers

1) First of all, you need to have a way to compute dissimilarity between candidate solutions (apart from their fitness). As mentioned, if your genome size is constant, you could always use good-old Euclidean distance. If not, you might have to be creative. If you are using Scikit-Learn’s implementation of k-D Trees like me, you could check out the metrics it supports by doing  kd_tree.valid_metrics  over an instance.

2) Because the selection of the first parent in step i. depends only by the global measure for fitness, parts of the search space that generate good solutions will have higher numbers of candidates in subsequent iterations.

3) Unlike using K-Medoids for speciation, you don’t need to have a strict definition of the species that a given candidate belongs to. Nearest-neighbour searching ensures that a candidate usually mates only with other candidates that are similar to it.

4) As you might have understood, the Temperature parameter only adds some amount of exploratory tendencies to the algorithm. The higher the temperature, the greater is the tendency to ignore speciation. The decrease in t could most likely be done linearly from the user-provided starting point T, to a pre-defined minimum value(like 0.05), across iterations.

Effects on Performance of the GA

As mentioned in the discussion here, it doesn’t make much sense to analytically define the time-complexity of GAs. However, let me give you a rough idea of how things will change. Assume the total size of the population is $n$. First off, Roulette-Wheel selection (done in step i) takes $O(n)$ time if done in a naive way (going through all candidates in some order one by one). If you do it using the stochastic acceptance method, it takes $O(1)$ time. Usually, this selection will be done twice during every crossover. In case of the algorithm outlined above, for approximately $(1-t)$ fraction of the second crossovers, the Roulette-wheel selection will be replaced by an $O(log(n))$ search using the k-D Tree and a much smaller (almost $O(1)$) Roulette Wheel selection amongst the nearest V candidates. Hence, unless your population size is very, very large, this won’t affect time that drastically.

The only significant increase in time will be seen as a result of the k-D Tree learning at the end of each iteration, which would add an $O(nlog(n))$ time to the time required for generating the next-gen pool of candidates (which is $O(n^2)$ for naive implementations anyways). Since $O(n^2 + nlog n)$ is equivalent to $O(n^2)$, the performance decrease isn’t that significant, provided the size of your population isn’t really high. Even if you were using a highly efficient selection procedure during crossovers, for reasonable population sizes, your total run-time would only be multiplied by a factor less than 2 (mainly because crossover and mutation by themselves take some time too).

Thats all for now! Keep reading and cheers 🙂

# Linear and Quadratic Discriminant Analysis for ML / statistics newbies

(Note: This post assumes that the reader has knowledge of basic statistics and terms used in machine learning. Inspite of that, I have provided links whereever necessary 🙂 ).

Linear Discriminant Analysis(LDA) and Quadratic Discriminant Analysis(QDA) are types of Bayesian classifiers.

A Bayesian classifier, in mathematical terms, does the following-

$C^{bayesian}(x) = argmax_{r \in \{1, 2, ..., k\}} Pr(Y = r | X = x)$

What does this mean? To put it in the form of steps, heres what happens-

1. The classifier is given an input($X$) that is the feature vector $x$. $x$ consists of $p$ different predictors, one for each input dimension.

2. Then, with respect to each possible output class($Y$) $r$ from $1, 2, ..., k$, the classifier computes $Pr(Y = r | X = x)$ – this is the probability that the actual output is $r$, given $x$ as the input.

3. The class (i.e. that value of $r$) that returns the maximum value for the above mentioned probability, is given as the output of the classifier – $C^{bayesian}(x)$.

In essence, a Bayesian classifier tries to minimize the probability of misclassification (which would be equal to $1 - Pr(Y = r | X = x)$).

But then the most natural question is- How is this probability computed? It is precisely at this point where all Bayesian classifiers differ. Each one of them uses a different technique to understand the data, and compute these crucial quantities. In fact, even a simple nearest-neighbors classifier can be used as a Bayesian classifier, with $Pr(Y = r | X = x)$ being the fraction of the nearest neighbors to $x$  that belong to class $r$. LDA and QDA, on the other hand, use a more mathematical approach.

Anyone acquainted with basic statistics knows Bayes theorem. Given two possible events $A$ and $B$, it states-

$Pr(A|B) = \frac{Pr(A)Pr(B|A)}{Pr(B)}$

Applying this to $Pr(Y = r | X = x)$, we would get

$Pr(Y = r | X = x) = \frac{Pr(Y = r)Pr(X = x | Y = r)}{Pr(X = x)}$.

Now, look at the first part in the above numerator- $Pr(Y = r)$. This is essentially the probability of the class of a random input being $r$ – prior to even looking at what the input is! Thats why its called the priori probability. During the training stage itself, it is computed as the fraction of the training samples that belonged to class $r$. As you might have understood by now, this number is independent of the what $x$ is, and reflects the trends seen in the training set.

Now lets consider the second part – $Pr(X = x | Y = r)$. This number is conceptually equivalent to answering the question- Given that a random data point(not necessarily in the training data) was being selected from class $r$, what would be the likelihood that it looked like $x$? To quantify this ‘likelihood’, LDA and QDA use a Multivariate Gaussian Distribution model for each class. Mathematically put, the probability density function for a multivariate Gaussian distribution is-

$f_{\boldsymbol\Sigma, \boldsymbol\mu}(x) = \frac{1}{\sqrt{(2\pi)^k|\boldsymbol\Sigma|}}\exp(-\frac{1}{2}({ x}-{\boldsymbol\mu})^\mathrm{T}{\boldsymbol\Sigma}^{-1}({x}-{\boldsymbol\mu}))$.

You definitely don’t need to remember this huge-ass formula. To put it simply, heres the intuition behind it- A probability distribution model is a way for an algorithm to understand how data points are distributed in a $p$-dimensional space. In the case of the multivariate Gaussian(also called Normal) distribution, the two parameters $\boldsymbol\mu$ (the $p$-dimensional mean vector) and $\boldsymbol\Sigma$ (the $p$x$p$-dimensional covariance matrix) are used to ‘summarize’ how the data ‘looks’ in the conceptual space. These are obviously learnt during training.

While $\boldsymbol\mu$ denotes the approximate average position, $\boldsymbol\Sigma$ summarizes the ‘shape’ of the distribution of data points. (Imagine drawing a random cluster of points on a graph paper. Now draw an ellipse-like shape over that region where most of the points are concentrated. The ‘central’ point of the ellipse will be a 2-dimensional vector $\boldsymbol\mu$, and the ‘shape’ would be approximated by a 2×2 matrix given by $\boldsymbol\Sigma$.) The corresponding probability density function(pdf) for a random $x$ in space, is proportional to how likely it is that a point at that location would be selected randomly from the distribution. The pdf is to probability, what pressure is to force. Pressure isn’t exactly force itself, but it shows how intense the latter is at a particular location- and the higher the pressure, the greater is the likelihood that the force will show its effect at that point.

Thus, LDA and QDA use the aforementioned $f(x)$ to quantify $Pr(X = x | Y = r)$. What about the denominator then – $Pr(X = x)$? It is nothing but the sum of the values of $Pr(Y = r)Pr(X = x | Y = r)$ over all possible $r$s. This is given by the Total Probability Theorem.

But then where do LDA and QDA differ? They differ in only one subtle point- While LDA assumes the same $\boldsymbol\Sigma$ for all classes, QDA computes a different $\boldsymbol\Sigma$ for each class. Essentially, LDA computes a separate $\boldsymbol\mu$ for each class(using training points that belonged to it), but $\boldsymbol\Sigma$ is computed using the entire training data. And this common covariance matrix is used in the computations corresponding to every class. QDA, on the other hand, computes a separate $\boldsymbol\Sigma$ and $\boldsymbol\mu$ for each possible output class.

We will now tackle the two most important question that might be plaguing your mind-

1. Why is LDA ‘linear’, and QDA ‘quadratic’?

The answer lies in the probability function. Consider the simplest case of one input variable only. The answer can be generalized to multiple dimensions using matrix algebra.

$Pr(Y = r | X = x) = \frac{Pr(Y = r)Pr(X = x | Y = r)}{Pr(X = x)}$

We are maximizing over all possible $r$s right? Intuitively, that should tell you that the denominator in the above equation, doesn’t really matter in comparisons- since its the same for all values of $r$. So we are now down to finding that $r$ which gives the maximum value for

$Pr(Y = r)Pr(X = x | Y = r)$

Let $Pr(Y = r) = \beta_{r}$ (to reduce my LaTeX efforts). Using the formula for 1-D Normal distribution density function, we have

$Pr(X = x | Y = r) = \frac{1}{\sigma_r\sqrt{2\pi}}e^\frac{(x - \mu_r)^2}{\sigma_r^2}$

Multiplying the two and taking a natural logarithm of the resultant expression, we are now down to finding that $r$ that gives the maximum value for

$log(\beta_{r}/\sigma_r) - \frac{x^2}{2\sigma_r^2} - \frac{\mu_r^2}{2\sigma_r^2} + \frac{x\mu_r}{\sigma_r^2}$

Observe the above quadratic(in x) expression carefully. Once again, I have omitted those terms that are constants or which remain the same irrespective of $r$(NOT $x$), which is the focus of the whole process. Or have I? Consider the second term. IF I am talking of the LDA classifier, would $\sigma_r$ be dependent on $r$ at all? It wouldn’t, since LDA assumes the same variance/covariance for all classes. As a result, the second term could no longer be something worth taking into the comparison process. Incidentally, that very term is what makes the above expression quadratic. Otherwise, its linear!

And thats (more or less) why LDA is ‘linear’ and QDA is ‘quadratic’ 🙂 .

2. Why would we ever choose LDA? Isn’t QDA conceptually more flexible?

QDA isn’t always better. In a certain way(too mathematically convoluted for this post), LDA does learn a linear boundary between the points belonging to different classes. QDA, on the other hand, learns a quadratic one. If the actual boundaries are indeed linear, QDA may have a higher model bias. – especially if the available data isn’t dense enough. Another(and easier to imagine) scenario is when you have a very limited set of training points. In such a case, if you divide your already sparse dataset into the constituent classes, the covariance matrix computed for each might be extremely inaccurate. In such cases, its better to simplify the entire process and use a common covariance matrix thats computed from the entire dataset we use while training.

# Best Practices When Starting And Working On A Data Science Project

Pretty straightforward tips for someone learning to build intelligent systems.

Several interesting questions were asked recently on Data Science Central. This post addresses the question

“What best practices do you recommend, when starting and working on enterprise analytics projects?”

I have worked as a Data Scientist for 8 years now. This was after completing a PhD on “Design of Experiments for Tuning Optimisation Algorithms”. So I have a formal background in rigorous experiment design for Data Science and have also managed some pretty complex and fast paced projects in sectors including Financial Services, IT, Insurance, Government and Audit.

This post summarises my thoughts on best practice that are heavily based on practical experience as described in my book “Guerrilla Analytics: A Practical Approach to Working with Data”. The book contains almost 100 best practice tips for doing Data Science in dynamic projects where reproducibillty, explainability and team efficiency are critical.

Here is a summary of the best practices for working…

View original post 2,399 more words