A (small) introduction to Boosting

What is Boosting?

Boosting is a machine learning meta-algorithm that aims to iteratively build an ensemble of weak learners, in an attempt to generate a strong overall model.

Lets look at the highlighted parts one-by-one:

1. Weak Learners: A ‘weak learner’ is any ML algorithm (for regression/classification) that provides an accuracy slightly better than random guessing. For example, consider a problem of binary classification with approximately 50% of samples belonging to each class. Random guessing in this case would yield an accuracy of around 50%. So a weak learner would be any algorithm, however simple, that slightly improves this score – say 51-55% or more. Usually, weak learners are pretty basic in nature. Some examples are Decision Stumps or smaller Decision Trees. In most cases, either one of these two is used as a weak learner in Boosting frameworks.

2. Ensemble: The overall model built by Boosting is a weighted sum of all of the weak learners. The weights and training given to each ensures that the overall model yields a pretty high accuracy (sometimes state-of-the-art).

3. Iteratively build: Many ensemble methods such as bagging/random forests are capable of training their components in parallel. This is because the training of any of those weak learners does not depend on the training of the others. That is not the case with Boosting. At each step, Boosting tries to evaluate the shortcomings of the overall model built till the previous step, and then generates a weak learner to battle these shortcomings. This weak learner is then added to the total model. Therefore, the training must necessarily proceed in a serial/iterative manner.

4. Meta-algorithm: Since Boosting isn’t necessarily an ML algorithm by itself, but rather uses other (basic) algorithms to build a stronger one, it is said to be a ‘meta’ algorithm.

How does Boosting work?

boosting_new_fit5

Usually, a Boosting framework for regression works as follows:

(The overall model at step ‘i’ is denoted by F_i.)

1. Start with a simple model F_0, usually predicting a common (average) value for all instances.

2. For i from 1 to M do:

2.a) Evaluate the shortcomings of F_{i - 1}, in terms of a value r_j for each element x_j in the training set.
2.b) Generate a new weak learner h_i(x) based on the r_js.
2.c) Compute the corresponding weight \gamma_i.
2.d) Update the current model as follows:

F_i = F_{i - 1} + \nu \gamma_i h_i(x)

where \nu denotes the learning rate of the framework.

3. Return F_M as the overall output.

Each of the iterations in Boosting essentially tries to ‘improve’ the current model by introducing another learner into the ensemble. Having such an ensemble not only reduces the bias (which is generally pretty high for weak learners), but also the variance (since multiple learners contribute to the overall output, each with their own unique training). There are various versions of Boosting, and all of them primarily differ in the way they perform steps 2.a, 2.b. and 2.c.

For example, Gradient Boosting computes r_j as the gradient of a Loss function (whose expression involves target output y and model output till previous step, F_{i - 1}) with respect to F_{i - 1} at data point x_j. The ith weak learner h_i is then trained to predict r_j from x_j. The weight \gamma_i is then optimized so as to minimize the total Loss value with respect to F_i.

AdaBoost, short for Adaptive Boosting, works by assigning r_j as a ‘weight’ to x_j based on how miscalculated the model output is according to F_{i - 1}. h_i and \gamma_i are then trained to optimize the weighted sum of errors with respect to the output from h_i. AdaBoost is considered to be one of the best out-of-the-box classifiers today. Mathematically speaking, AdaBoost is a special case of Gradient Boosting.

The weak-learners used in Boosting are indeed ‘weak’. The most commonly used ones are decision trees of a fixed number of leaf nodes. Using complex methods such as SVMs in Boosting usually leads to overfitting. However, the total number of weak learners M should also be controlled, mostly based on experimentation. Having a low M will cause underfitting, but a very high value could lead to overfitting.

To improve the bias/variance characteristics of Boosting, bagging is generally employed. What this essentially does is train each weak learner on a subset of the overall dataset (rather than the whole training set). This causes a good decrease in variance and tends to reduce overfitting.

Classification applications would work slightly different from regression, of course. The most commonly used method is to build a boosted model for each class. This model would predict the probability of a data point belonging to that class. During training, the inputs belonging to the corresponding class will be given a target of 1, with the others getting a target of 0. The outputs from all the models could then be put into a softmax function for normalization.

What are the drawbacks of Boosting?

The most obvious drawback of boosting is the computational complexity. Performing so many iterations, and generating a new model at each, requires a lot of computations and time (and also space). The fact that the ensemble has to be built iteratively doesn’t help matters. However, using simple weak learners (instead of big decision trees) helps to mitigate this problem.

The biggest criticism for Boosting comes from its sensitivity to noisy data. Think of it this way. At each iteration, Boosting tries to improve the output for data points that haven’t been predicted well till now. If your dataset happens to have some (or a lot) of misclassified/outlier points, the meta-algorithm will try very hard to fit subsequent weak learners to these noisy samples. As a result, overfitting is likely to occur. The exponential loss function used in AdaBoost is particularly vulnerable to this issue (since the error from an outlier will get an exponentially-weighed importance for future learners).

Implementation in Scikit-Learn

  1. AdaBoost implementation
  2. Gradient Tree Boosting implementation
Advertisements

Linear and Quadratic Discriminant Analysis for ML / statistics newbies

(Note: This post assumes that the reader has knowledge of basic statistics and terms used in machine learning. Inspite of that, I have provided links whereever necessary 🙂 ).

Linear Discriminant Analysis(LDA) and Quadratic Discriminant Analysis(QDA) are types of Bayesian classifiers.

A Bayesian classifier, in mathematical terms, does the following-

C^{bayesian}(x) = argmax_{r \in \{1, 2, ..., k\}} Pr(Y = r | X = x)

What does this mean? To put it in the form of steps, heres what happens-

1. The classifier is given an input(X) that is the feature vector x. x consists of p different predictors, one for each input dimension.

2. Then, with respect to each possible output class(Y) r from 1, 2, ..., k, the classifier computes Pr(Y = r | X = x) – this is the probability that the actual output is r, given x as the input.

3. The class (i.e. that value of r) that returns the maximum value for the above mentioned probability, is given as the output of the classifier – C^{bayesian}(x).

In essence, a Bayesian classifier tries to minimize the probability of misclassification (which would be equal to 1 - Pr(Y = r | X = x)).

But then the most natural question is- How is this probability computed? It is precisely at this point where all Bayesian classifiers differ. Each one of them uses a different technique to understand the data, and compute these crucial quantities. In fact, even a simple nearest-neighbors classifier can be used as a Bayesian classifier, with Pr(Y = r | X = x) being the fraction of the nearest neighbors to x  that belong to class r. LDA and QDA, on the other hand, use a more mathematical approach.

Anyone acquainted with basic statistics knows Bayes theorem. Given two possible events A and B, it states-

Pr(A|B) = \frac{Pr(A)Pr(B|A)}{Pr(B)}

Applying this to Pr(Y = r | X = x), we would get

Pr(Y = r | X = x) = \frac{Pr(Y = r)Pr(X = x | Y = r)}{Pr(X = x)}.

Now, look at the first part in the above numerator- Pr(Y = r). This is essentially the probability of the class of a random input being r – prior to even looking at what the input is! Thats why its called the priori probability. During the training stage itself, it is computed as the fraction of the training samples that belonged to class r. As you might have understood by now, this number is independent of the what x is, and reflects the trends seen in the training set.

Now lets consider the second part – Pr(X = x | Y = r). This number is conceptually equivalent to answering the question- Given that a random data point(not necessarily in the training data) was being selected from class r, what would be the likelihood that it looked like x? To quantify this ‘likelihood’, LDA and QDA use a Multivariate Gaussian Distribution model for each class. Mathematically put, the probability density function for a multivariate Gaussian distribution is-

f_{\boldsymbol\Sigma, \boldsymbol\mu}(x) = \frac{1}{\sqrt{(2\pi)^k|\boldsymbol\Sigma|}}\exp(-\frac{1}{2}({ x}-{\boldsymbol\mu})^\mathrm{T}{\boldsymbol\Sigma}^{-1}({x}-{\boldsymbol\mu})).

You definitely don’t need to remember this huge-ass formula. To put it simply, heres the intuition behind it- A probability distribution model is a way for an algorithm to understand how data points are distributed in a p-dimensional space. In the case of the multivariate Gaussian(also called Normal) distribution, the two parameters \boldsymbol\mu (the p-dimensional mean vector) and \boldsymbol\Sigma (the pxp-dimensional covariance matrix) are used to ‘summarize’ how the data ‘looks’ in the conceptual space. These are obviously learnt during training.

While \boldsymbol\mu denotes the approximate average position, \boldsymbol\Sigma summarizes the ‘shape’ of the distribution of data points. (Imagine drawing a random cluster of points on a graph paper. Now draw an ellipse-like shape over that region where most of the points are concentrated. The ‘central’ point of the ellipse will be a 2-dimensional vector \boldsymbol\mu, and the ‘shape’ would be approximated by a 2×2 matrix given by \boldsymbol\Sigma.) The corresponding probability density function(pdf) for a random x in space, is proportional to how likely it is that a point at that location would be selected randomly from the distribution. The pdf is to probability, what pressure is to force. Pressure isn’t exactly force itself, but it shows how intense the latter is at a particular location- and the higher the pressure, the greater is the likelihood that the force will show its effect at that point.

Thus, LDA and QDA use the aforementioned f(x) to quantify Pr(X = x | Y = r). What about the denominator then – Pr(X = x)? It is nothing but the sum of the values of Pr(Y = r)Pr(X = x | Y = r) over all possible rs. This is given by the Total Probability Theorem.

But then where do LDA and QDA differ? They differ in only one subtle point- While LDA assumes the same \boldsymbol\Sigma for all classes, QDA computes a different \boldsymbol\Sigma for each class. Essentially, LDA computes a separate \boldsymbol\mu for each class(using training points that belonged to it), but \boldsymbol\Sigma is computed using the entire training data. And this common covariance matrix is used in the computations corresponding to every class. QDA, on the other hand, computes a separate \boldsymbol\Sigma and \boldsymbol\mu for each possible output class.

We will now tackle the two most important question that might be plaguing your mind-

1. Why is LDA ‘linear’, and QDA ‘quadratic’?

The answer lies in the probability function. Consider the simplest case of one input variable only. The answer can be generalized to multiple dimensions using matrix algebra.

Pr(Y = r | X = x) = \frac{Pr(Y = r)Pr(X = x | Y = r)}{Pr(X = x)}

We are maximizing over all possible rs right? Intuitively, that should tell you that the denominator in the above equation, doesn’t really matter in comparisons- since its the same for all values of r. So we are now down to finding that r which gives the maximum value for

Pr(Y = r)Pr(X = x | Y = r)

Let Pr(Y = r) = \beta_{r} (to reduce my LaTeX efforts). Using the formula for 1-D Normal distribution density function, we have

Pr(X = x | Y = r) = \frac{1}{\sigma_r\sqrt{2\pi}}e^\frac{(x - \mu_r)^2}{\sigma_r^2}

Multiplying the two and taking a natural logarithm of the resultant expression, we are now down to finding that r that gives the maximum value for

log(\beta_{r}/\sigma_r) - \frac{x^2}{2\sigma_r^2} - \frac{\mu_r^2}{2\sigma_r^2} + \frac{x\mu_r}{\sigma_r^2}

Observe the above quadratic(in x) expression carefully. Once again, I have omitted those terms that are constants or which remain the same irrespective of r(NOT x), which is the focus of the whole process. Or have I? Consider the second term. IF I am talking of the LDA classifier, would \sigma_r be dependent on r at all? It wouldn’t, since LDA assumes the same variance/covariance for all classes. As a result, the second term could no longer be something worth taking into the comparison process. Incidentally, that very term is what makes the above expression quadratic. Otherwise, its linear!

And thats (more or less) why LDA is ‘linear’ and QDA is ‘quadratic’ 🙂 .

2. Why would we ever choose LDA? Isn’t QDA conceptually more flexible?

QDA isn’t always better. In a certain way(too mathematically convoluted for this post), LDA does learn a linear boundary between the points belonging to different classes. QDA, on the other hand, learns a quadratic one. If the actual boundaries are indeed linear, QDA may have a higher model bias. – especially if the available data isn’t dense enough. Another(and easier to imagine) scenario is when you have a very limited set of training points. In such a case, if you divide your already sparse dataset into the constituent classes, the covariance matrix computed for each might be extremely inaccurate. In such cases, its better to simplify the entire process and use a common covariance matrix thats computed from the entire dataset we use while training.

Logistic Regression (for dummies)

(Note: This is a post attempting to explain the intuition behind Logistic Regression to readers NOT well acquainted with statistics. Therefore, you may not find any rigorous mathematical work in here.)

Logistic Regression is a type of classification algorithm involving a linear discriminant. What do I mean by that?

1. Unlike actual regression, logistic regression does not try to predict the value of a numeric variable given a set of inputs. Instead, the output is a probability that the given input point belongs to a certain class. For simplicity, lets assume that we have only two classes(for multiclass problems, you can look at Multinomial Logistic Regression), and the probability in question is P_+ -> the probability that a certain data point belongs to the ‘+‘ class. Ofcourse, P_- = 1 - P_+. Thus, the output of Logistic Regression always lies in [0, 1].

2. The central premise of Logistic Regression is the assumption that your input space can be separated into two nice ‘regions’, one for each class, by a linear(read: straight) boundary. So what does a ‘linear’ boundary mean? For two dimensions, its a straight line- no curving. For three dimensions, its a plane. And so on. This boundary will ofcourse be decided by your input data and the learning algorithm. But for this to make sense, it is clear that the data points MUST be separable into the two aforementioned regions by a linear boundary. If your data points do satisfy this constraint, they are said to be linear-separable. Look at the image below.

linearly_separable_4(source)

This dividing plane is called a linear discriminant, because 1. its linear in terms of its function, and 2. it helps the model ‘discriminate’ between points belonging to different classes.

(Note: If your points aren’t linearly separable in the original concept space, you could consider converting the feature vectors into a higher dimensional space by adding dimensions of interaction terms, higher degree terms, etc. Such usage of a linear algorithm in a higher dimensional space gives you some benefits of non-linear function learning, since the boundary would be non-linear if plotted back in the original input space.)

==========X===========

But how does Logistic Regression use this linear boundary to quantify the probability of a data point belonging to a certain class?

First, lets try to understand the geometric implication of this ‘division’ of the input space into two distinct regions. Assuming two input variables for simplicity(unlike the 3-dimensional figure shown above)- x_1 and x_2, the function corresponding to the boundary will be something like

\beta_0 + \beta_1 x_1 + \beta_2 x_2.

(It is crucial to note that x_1 and x_2 are BOTH input variables, and the output variable isn’t a part of the conceptual space- unlike a technique like linear regression.)

Consider a point (a, b). Plugging the values of  x_1 and x_2 into the boundary function, we will get its output \beta_0 + \beta_1 a + \beta_2 b. Now depending on the location of (a, b), there are three possibilities to consider-

I. (a, b) lies in the region defined by points of the + class. As a result, \beta_0 + \beta_1 a + \beta_2 b will be positive, lying somewhere in (0, \infty). Mathematically, the higher the magnitude of this value, the greater is the distance between the point and the boundary. Intuitively speaking, the greater is the probability that (a, b) belongs to the + class. Therefore, P_+ will lie in (0.5, 1].

2. (a, b) lies in the region defined by points of the - class. Now, \beta_0 + \beta_1 a + \beta_2 b will be negative, lying in (-\infty, 0). But like in the positive case, higher the absolute value of the function output, greater the probability that (a, b) belongs to the - class. P_+ will now lie in [0, 0.5).

3. (a, b) lies ON the linear boundary. In this case, \beta_0 + \beta_1 a + \beta_2 b = 0. This means that the model cannot really say whether (a, b) belongs to the + or - class. As a result, P_+ will be exactly 0.5.

Great. So now we have a function that outputs a value in (-\infty, \infty) given an input data point. But how do we map this to the probability P_+, that goes from [0, 1]? The answer, is in the odds function.

Let P(X) denote the probability of an event X occurring. In that case, the odds ratio (OR(X)) is defined by  \frac{P(X)}{1-P(X)}, which is essentially the ratio of the probability of the event happening, vs. it not happening. It is clear that probability and odds convey the exact same information. But as $P(X)$ goes from 0 to 1, OR(X) goes from 0 to \infty.

However, we are still not quite there yet, since our boundary function gives a value from –\infty to \infty. So what we do, is take the logarithm of OR(X), called the log-odds function. Mathematically, as OR(X) goes from 0 to \infty, log(OR(X)) goes from –\infty to \infty!

So we finally have a way to interpret the result of plugging in the attributes of an input into the boundary function. The boundary function actually defines the log-odds of the + class, in our model. So essentially, inour two-dimensional example, given a point (a, b), this is what Logistic regression would do-

Step 1. Compute the boundary function(alternatively, the log-odds function) value, \beta_0 + \beta_1 a + \beta_2 b. Lets call this value t for short.

Step 2. Compute the Odds Ratio, by doing OR_+ = e^t. (Since t is the logarithm of OR_+).

Step 3. Knowing OR_+, it would compute P_+ using the simple mathematical relation

P_+ = \frac{OR_+}{1 + OR_+}.

There you go! In fact, once you know t from step 1, you can combine steps 2 and 3 to give you

P_+ = \frac{e^t}{1 + e^t}

The RHS of the above equation is called the logistic function. Hence the name given to this model of learning :-).

==========X===========

We have now understood the intuition behind Logistic Regression, but the question remains- How does it learn the boundary function \beta_0 + \beta_1 x_1 + \beta_2 x_2? The mathematical working behind this is beyond the scope of this post, but heres a rough idea:

Consider a function g(x), where x is a data point in the training dataset. g(x) can be defined in simple terms as:

If x is a part of the + class, g(x) = P_+ (Here, P_+ is the output given by your Logistic Regression model). If x is a part of the - class, g(x) = 1 - P_+.

Intuitively, g(x) quantifies the probability that a training point was classified correctly by your model. Therefore, if you average g(x) over your entire training data, you would get the likelihood that a random data point would be classified correctly by your system, irrespective of the class it belongs to. Simplifying things a little, it is this ‘average’ g(x) that a Logistic Regression learner tries to maximize. The method adopted for the same is called maximum likelihood estimation (for obvious reasons). Unless you are a mathematician, you can do without learning how the optimization happens, as long as you have a good idea of what is being optimized – mostly because most statistics or ML libraries have inbuilt methods to get it done.

==========X===========

Thats all for now! And like all my blog posts, I hope this one helps some guy trying to Google up and learn some stuff on his own, understand the misunderstood technique of Logistic Regression. Cheers!