Cross Validation and the Bias-Variance tradeoff (for Dummies)

(Note: This post assumes that the reader is familiar with the Bias-Variance concept in statistical learning. If not, this post will help.)

Most ML newbies out there have a pretty good grasp on every learning algorithm, but lack one crucial piece of the puzzle- Understanding a way to test and validate their model with respect to real-world data. This is, infact, the most important part of the whole data mining process. And one that is commonly ignored. This post tries to acquaint the reader with an important(and very powerful) method of ensuring that a given model works, before it’s let out into the world.

Literature on testing ML models always talks about a ‘training’ dataset and a ‘test’ dataset. However, in a practical scenario, we usually don’t get an actual ‘test’ set until we go for deployment. And since the training set is the one we use for helping the model learn, its not usually an accurate indicator of the model’s effectiveness with respect to unseen data. One simple(and straight-forward) way of solving this problem, is called the hold-out method. Its exactly what it sounds like:

i. You hold-out a part of the available dataset from the whole learning process.

ii. You then train your model using the remaining portion of your dataset.

iii. The part of the dataset you held-out during training is now used as the test set, since its actually unseen from the model’s POV. The testing framework will usually give you an indicator of the model’s effectiveness, for e.g. the mean squared error in case of linear regression.

iv. If the indicator in the previous step suggests that the model works well enough, it is then trained using the entire dataset we previously had.

v. Deployment!

This is roughly how it goes. However, how you hold-out the data and how many times you run steps i-iii determine how accurately the method will assess your model. In fact, theres a bias-variance tradeoff inherent in the entire process! Lets take each case one by one (Remember, each time I mention bias or variance, it is with respect to the testing process, and not your model- unless otherwise mentioned):

1. The Validation Set Approach

This is a pretty straight-forward way of doing it. You reserve around half of your original dataset for testing(or validation), and the other half for training. Once you get an estimate of the model’s error, you may also use the portion previously used for training for testing now, and vice versa. Effectively, this gives you two estimates of how well your model works.

However, there are a couple of problems with this approach. First off, you get only one(or two) estimates of the model’s accuracy. Though this is better than having no test dataset, it still leaves a lot of room for improvement. Secondly, since you reserved half of the original dataset for testing, you induced a significant amount of bias into the whole process- in plain English, do you think the framework trained with half of the dataset will be comparable to the one trained with the complete dataset? Usually, No. The next method tries to tackle this.

2. Leave-One-Out- Cross Validation (LOOCV)

In this case, we run steps i-iii of the hold-out technique, multiple times. Each time, only one of the data-points in the available dataset is held-out and the model is trained with respect to the rest. Therefore, this process allows the entire procedure of training+testing to be run as many times as the number of data-points in your training set. Since each data point appears in the training as well as test set(in different iterations), the process is called cross-validation. The good thing about this approach is that there is negligible bias inherent in the testing process- Since you are using almost your entire dataset for training each time, the model you come up with would be pretty close to the real thing.

However, we have now induced a significant amount of variance into our testing technique. Since there is only one data point being used for testing every time, the variance in the estimates of your model’s error would be pretty high! (Especially if you have multiple outliers in your dataset) This wasn’t a problem with the previous method, since half of the total dataset was being used for testing. Moreover, this technique is computationally very, very intensive- you have to train and test your model as many times as there are number of data points. This can spell trouble if your dataset contains millions of them.

 

The two approaches described above give us a fair idea of what we want from our final testing procedure-

I) (Minimizing the testing bias) We want a sufficiently large portion of the dataset for training. In other words, we don’t want to hold-out too much of the data for testing. This mainly ensures that the model we train as a part of the testing procedure, would be as close as possible to the one that we would get if we trained it on the entire dataset.

II) (Minimizing the testing variance) We want a good amount of data points for testing. If too little a number is used for this purpose, you will not be able to trust estimate of error that is given by your testing framework- this is especially true if your luck is really bad and you happen to randomly choose the few bad data points from your dataset(the outliers) for testing.

III) We would want to run the whole procedure of testing+training multiple times, to ensure that our model consistently gives low estimates of error (giving a low estimate of error means your model itself has low bias, and doing so consistently means low variance too).

The best way to do this, is:

3. k-fold Cross-Validation

This is a brilliant way of achieving the bias-variance tradeoff in your testing process AND ensuring that your model itself has low bias and low variance. The testing procedure can be summarized as follows (where k is an integer) –

i. Divide your dataset randomly into k different parts.

ii. Repeat k times:

a. Choose a part out of the k available ones (one that hasn’t been chosen before, obviously).

b. Use the other k-1 parts together for training your model.

c. Use the part that was held-out in step ii.a., for testing. This will give you one of the k estimates of your model’s error.

For k=10, the process can be visualized as follows:

cv

Choosing a good value of ‘k‘ ensures that your testing procedure gives you the best possible estimate of how well your model works. Choose a value too small, and you will drift towards the extremity of using the Validation Set Approach(biased testing). Choose a value too large, and it will be more like LOOCV (too much of variance in the testing procedure, and computationally intensive). Usually, a value between 5-10 is used in practical Machine Learning scenarios.

That brings us to the final question:

How do we assess the model’s bias-variance characteristics using k-fold Cross Validation?

When you perform k-fold CV, you get k different estimates of your model’s error- say e_1, e_2, e_3, ..., e_k. Since each e_i is an error estimate, it should ideally be zero.

To check out you model’s bias, find out the mean of all the e_is. If this value is low, it basically means that your model gives low error on an average– indirectly ensuring that your model’s notions about the data are accurate enough.

To check out your model’s variance, compute the standard deviation of all the e_is. If this value is high, it means that your model’s performance(in terms of its error) varies a lot with the dataset used for training- something you don’t want.

Obviously, what is ‘high’ enough to rule out your model as being ineffective, depends on your particular application and your domain knowledge (and subsequently your skill as a data miner). Nevertheless, k-fold Cross-Validation is a powerful practical tool to quantify a model’s performance, and has been shown to pretty accurately estimate the real-world numbers in a wide range of scenarios.

Logistic Regression (for dummies)

(Note: This is a post attempting to explain the intuition behind Logistic Regression to readers NOT well acquainted with statistics. Therefore, you may not find any rigorous mathematical work in here.)

Logistic Regression is a type of classification algorithm involving a linear discriminant. What do I mean by that?

1. Unlike actual regression, logistic regression does not try to predict the value of a numeric variable given a set of inputs. Instead, the output is a probability that the given input point belongs to a certain class. For simplicity, lets assume that we have only two classes(for multiclass problems, you can look at Multinomial Logistic Regression), and the probability in question is P_+ -> the probability that a certain data point belongs to the ‘+‘ class. Ofcourse, P_- = 1 - P_+. Thus, the output of Logistic Regression always lies in [0, 1].

2. The central premise of Logistic Regression is the assumption that your input space can be separated into two nice ‘regions’, one for each class, by a linear(read: straight) boundary. So what does a ‘linear’ boundary mean? For two dimensions, its a straight line- no curving. For three dimensions, its a plane. And so on. This boundary will ofcourse be decided by your input data and the learning algorithm. But for this to make sense, it is clear that the data points MUST be separable into the two aforementioned regions by a linear boundary. If your data points do satisfy this constraint, they are said to be linear-separable. Look at the image below.

linearly_separable_4(source)

This dividing plane is called a linear discriminant, because 1. its linear in terms of its function, and 2. it helps the model ‘discriminate’ between points belonging to different classes.

(Note: If your points aren’t linearly separable in the original concept space, you could consider converting the feature vectors into a higher dimensional space by adding dimensions of interaction terms, higher degree terms, etc. Such usage of a linear algorithm in a higher dimensional space gives you some benefits of non-linear function learning, since the boundary would be non-linear if plotted back in the original input space.)

==========X===========

But how does Logistic Regression use this linear boundary to quantify the probability of a data point belonging to a certain class?

First, lets try to understand the geometric implication of this ‘division’ of the input space into two distinct regions. Assuming two input variables for simplicity(unlike the 3-dimensional figure shown above)- x_1 and x_2, the function corresponding to the boundary will be something like

\beta_0 + \beta_1 x_1 + \beta_2 x_2.

(It is crucial to note that x_1 and x_2 are BOTH input variables, and the output variable isn’t a part of the conceptual space- unlike a technique like linear regression.)

Consider a point (a, b). Plugging the values of  x_1 and x_2 into the boundary function, we will get its output \beta_0 + \beta_1 a + \beta_2 b. Now depending on the location of (a, b), there are three possibilities to consider-

I. (a, b) lies in the region defined by points of the + class. As a result, \beta_0 + \beta_1 a + \beta_2 b will be positive, lying somewhere in (0, \infty). Mathematically, the higher the magnitude of this value, the greater is the distance between the point and the boundary. Intuitively speaking, the greater is the probability that (a, b) belongs to the + class. Therefore, P_+ will lie in (0.5, 1].

2. (a, b) lies in the region defined by points of the - class. Now, \beta_0 + \beta_1 a + \beta_2 b will be negative, lying in (-\infty, 0). But like in the positive case, higher the absolute value of the function output, greater the probability that (a, b) belongs to the - class. P_+ will now lie in [0, 0.5).

3. (a, b) lies ON the linear boundary. In this case, \beta_0 + \beta_1 a + \beta_2 b = 0. This means that the model cannot really say whether (a, b) belongs to the + or - class. As a result, P_+ will be exactly 0.5.

Great. So now we have a function that outputs a value in (-\infty, \infty) given an input data point. But how do we map this to the probability P_+, that goes from [0, 1]? The answer, is in the odds function.

Let P(X) denote the probability of an event X occurring. In that case, the odds ratio (OR(X)) is defined by  \frac{P(X)}{1-P(X)}, which is essentially the ratio of the probability of the event happening, vs. it not happening. It is clear that probability and odds convey the exact same information. But as $P(X)$ goes from 0 to 1, OR(X) goes from 0 to \infty.

However, we are still not quite there yet, since our boundary function gives a value from –\infty to \infty. So what we do, is take the logarithm of OR(X), called the log-odds function. Mathematically, as OR(X) goes from 0 to \infty, log(OR(X)) goes from –\infty to \infty!

So we finally have a way to interpret the result of plugging in the attributes of an input into the boundary function. The boundary function actually defines the log-odds of the + class, in our model. So essentially, inour two-dimensional example, given a point (a, b), this is what Logistic regression would do-

Step 1. Compute the boundary function(alternatively, the log-odds function) value, \beta_0 + \beta_1 a + \beta_2 b. Lets call this value t for short.

Step 2. Compute the Odds Ratio, by doing OR_+ = e^t. (Since t is the logarithm of OR_+).

Step 3. Knowing OR_+, it would compute P_+ using the simple mathematical relation

P_+ = \frac{OR_+}{1 + OR_+}.

There you go! In fact, once you know t from step 1, you can combine steps 2 and 3 to give you

P_+ = \frac{e^t}{1 + e^t}

The RHS of the above equation is called the logistic function. Hence the name given to this model of learning :-).

==========X===========

We have now understood the intuition behind Logistic Regression, but the question remains- How does it learn the boundary function \beta_0 + \beta_1 x_1 + \beta_2 x_2? The mathematical working behind this is beyond the scope of this post, but heres a rough idea:

Consider a function g(x), where x is a data point in the training dataset. g(x) can be defined in simple terms as:

If x is a part of the + class, g(x) = P_+ (Here, P_+ is the output given by your Logistic Regression model). If x is a part of the - class, g(x) = 1 - P_+.

Intuitively, g(x) quantifies the probability that a training point was classified correctly by your model. Therefore, if you average g(x) over your entire training data, you would get the likelihood that a random data point would be classified correctly by your system, irrespective of the class it belongs to. Simplifying things a little, it is this ‘average’ g(x) that a Logistic Regression learner tries to maximize. The method adopted for the same is called maximum likelihood estimation (for obvious reasons). Unless you are a mathematician, you can do without learning how the optimization happens, as long as you have a good idea of what is being optimized – mostly because most statistics or ML libraries have inbuilt methods to get it done.

==========X===========

Thats all for now! And like all my blog posts, I hope this one helps some guy trying to Google up and learn some stuff on his own, understand the misunderstood technique of Logistic Regression. Cheers!