# Interesting take-aways from ‘Data Science For Business’

I have recently been reading Data Science for Business by Foster Provost and Tom Fawcett. I picked it up on Amazon some time back, while trying to find some good books on the applications of machine learning in industries. And I hit the bull’s eye with this one!

Data Science for Business does an awesome job at getting the reader acquainted with all the basic(as well as some niche) applications of ML frameworks in a business context. What this book does not do, is give you the rigor behind the algorithms discussed in it. The focus is on the when and why, not the how. For example, you will find detailed descriptions of the kind of problems Association Mining solves- from Market Basket Analysis to mining Facebook ‘like’ relationships- but you wont find details of the Apriori or any other actual AM techniques. Why I love this book, is because it does what most other ML texts don’t- giving the reader an intuitive understanding of when to apply a certain technique and why. For a newbie, reading this book side-by-side with an established text like Elements of Statistical Learning would perhaps be the best way to get acquainted with Machine Learning and its applications.

There were a lot of ‘Aha!’ moments for me while reading this book, especially on the lines of ‘I never thought it could be used this way!’ or ‘Thats a good rule of thumb to follow while applying this!’. To give you guys a brief idea, here are three interesting things I learned/revisited while reading this book:

1. CRISP Data Mining

Okay, anyone who has done a formal course in Data Mining knows about the CRoss Industry Standard Process for Data Mining, and for good reason. Its one of the most fundamental, extremely basic ways to go about mining any kind of data. A diagrammatic representation of the same is shown below:

However, while most courses would teach their students about this framework and leave it at that, this book tries to explain many other applications of statistical learning techniques in the context of CRISP data mining. For example, in an exploratory data mining technique such as clustering, its the Evaluation part of the process thats highly crucial- since you need to extract the maximum ‘information’ that you can from the results that a standard clustering technique would provide. More on this in a bit.

The book uses the Expected Value framework to augment the way in which CRISP is applied. It basically stresses the fact that you need to formalize your Business Understanding of the problem by expressing it mathematically- as a value to estimate, or function to optimize. At this stage, you would rather think of your solution as a black box- not focusing on the algorithms and implementations underneath. You then need to understand your data in terms of the available attributes, sparseness of the data, etc. (Data Understanding). Only once you have the mathematical formulation and familiarity with the data, are you in a position to combine the two in a synergistic mix of Data Preparation and Modelling – in essence, formulating a solution that performs best with the data you have.

2. Using Decision Trees to Understand Clusters

Why understand the output of clustering?

Since clustering, as I mentioned before, is an exploratory process, you don’t exactly know what you expect to find going in. You may have a rough idea of the number of clusters(or not), but not what each of them represent- atleast in the general case.

Lets consider a simple example. You have decided to use DBSCAN to cluster all of your customers into an unknown number of groups, where customers in the same cluster would have similar properties based on some criteria. You hope to use the output of this clustering to classify a new customer into one of these groups, and deal with him better given the characteristics that set him apart from the rest. You can easily use a supervised learning process to classify a new customer given the clustering output, but what you don’t have is an intuitive understanding of what the cluster represents. Usually, the following two methods are most likely used to ‘understand’ a given cluster-

i. Pick out a member of the cluster that has the highest value for a certain property. For example, if you are clustering companies, you could just pick the one that has the highest turnover- which would mean the company thats most ‘well-known’.

ii. Pick out that member which is closest to the centroid.

However, neither of these would be applicable in all contexts and problems. Especially if the cluster members don’t mean anything to you individually, or you need to focus on the trends in a cluster- which cannot be summarized by a single member anyways.

How do you use a Decision Tree here?

A Decision Tree is usually pretty under-estimated an algorithm when it comes to supervised learning. The biggest reason for this is its innate simplicity which results in people going for more ‘sophisticated’ techniques like SVMs(usually). However, this is exactly the property that makes us select a decision tree in this scenario.

To understand a cluster K, what you essentially do is this:

i. Take all members of cluster K on one side, and members of all other clusters on another side.

ii. ‘Learn’ a decision tree to classify any given data point as a member of K or not-member of K – essentially a one-vs-all classification.

iii. Decompose the decision tree into the corresponding set of rules.

Voila! What you now essentially have, is a set of rules that help you distinguish the properties of class K from the rest. By virtue of decision-tree learning algorithms such as C4.5, your rules will optimally be based on those attributes that play the biggest role in characterising members of cluster K. Elegant isn’t it? The book explains this using the famous Whiskies dataset.

3. Virtual Items in Association Mining

This isn’t a big technique of any kind, but a nifty thing to remember all the same. Mostly always, when talking of Association Mining, we think of conventional ‘items’ as they would appear in any Market-Basket analysis. However, one way to expand this mining process, is to include Virtual Items in the ‘Baskets’. For example, suppose you are a big chain of supermarkets across the country. Obviously, every time a customer buys something, you include all the purchased items in a ‘transaction’. What you could also include, are some other attributes of the transaction such as- Location, Gender of person buying, Time of the Data, etc. So a given transaction might look like (Mumbai, Male, Afternoon, Bread, Butter, Milk). Here, the first three ‘items’ aren’t really items at all! Hence, they are called ‘virtual items’. Nonetheless, if you enter these transactions into an Association Mining framework, you could end up finding relationships between conventional goods and properties such as gender and location!

These are just three of such things to learn from this book. I would definitely recommend this book to you, especially if you want to supplement your knowledge of various ML algorithms with the intuition of how and where to apply them. Cheers!

# Cross Validation and the Bias-Variance tradeoff (for Dummies)

(Note: This post assumes that the reader is familiar with the Bias-Variance concept in statistical learning. If not, this post will help.)

Most ML newbies out there have a pretty good grasp on every learning algorithm, but lack one crucial piece of the puzzle- Understanding a way to test and validate their model with respect to real-world data. This is, infact, the most important part of the whole data mining process. And one that is commonly ignored. This post tries to acquaint the reader with an important(and very powerful) method of ensuring that a given model works, before it’s let out into the world.

Literature on testing ML models always talks about a ‘training’ dataset and a ‘test’ dataset. However, in a practical scenario, we usually don’t get an actual ‘test’ set until we go for deployment. And since the training set is the one we use for helping the model learn, its not usually an accurate indicator of the model’s effectiveness with respect to unseen data. One simple(and straight-forward) way of solving this problem, is called the hold-out method. Its exactly what it sounds like:

i. You hold-out a part of the available dataset from the whole learning process.

ii. You then train your model using the remaining portion of your dataset.

iii. The part of the dataset you held-out during training is now used as the test set, since its actually unseen from the model’s POV. The testing framework will usually give you an indicator of the model’s effectiveness, for e.g. the mean squared error in case of linear regression.

iv. If the indicator in the previous step suggests that the model works well enough, it is then trained using the entire dataset we previously had.

v. Deployment!

This is roughly how it goes. However, how you hold-out the data and how many times you run steps i-iii determine how accurately the method will assess your model. In fact, theres a bias-variance tradeoff inherent in the entire process! Lets take each case one by one (Remember, each time I mention bias or variance, it is with respect to the testing process, and not your model- unless otherwise mentioned):

1. The Validation Set Approach

This is a pretty straight-forward way of doing it. You reserve around half of your original dataset for testing(or validation), and the other half for training. Once you get an estimate of the model’s error, you may also use the portion previously used for training for testing now, and vice versa. Effectively, this gives you two estimates of how well your model works.

However, there are a couple of problems with this approach. First off, you get only one(or two) estimates of the model’s accuracy. Though this is better than having no test dataset, it still leaves a lot of room for improvement. Secondly, since you reserved half of the original dataset for testing, you induced a significant amount of bias into the whole process- in plain English, do you think the framework trained with half of the dataset will be comparable to the one trained with the complete dataset? Usually, No. The next method tries to tackle this.

2. Leave-One-Out- Cross Validation (LOOCV)

In this case, we run steps i-iii of the hold-out technique, multiple times. Each time, only one of the data-points in the available dataset is held-out and the model is trained with respect to the rest. Therefore, this process allows the entire procedure of training+testing to be run as many times as the number of data-points in your training set. Since each data point appears in the training as well as test set(in different iterations), the process is called cross-validation. The good thing about this approach is that there is negligible bias inherent in the testing process- Since you are using almost your entire dataset for training each time, the model you come up with would be pretty close to the real thing.

However, we have now induced a significant amount of variance into our testing technique. Since there is only one data point being used for testing every time, the variance in the estimates of your model’s error would be pretty high! (Especially if you have multiple outliers in your dataset) This wasn’t a problem with the previous method, since half of the total dataset was being used for testing. Moreover, this technique is computationally very, very intensive- you have to train and test your model as many times as there are number of data points. This can spell trouble if your dataset contains millions of them.

The two approaches described above give us a fair idea of what we want from our final testing procedure-

I) (Minimizing the testing bias) We want a sufficiently large portion of the dataset for training. In other words, we don’t want to hold-out too much of the data for testing. This mainly ensures that the model we train as a part of the testing procedure, would be as close as possible to the one that we would get if we trained it on the entire dataset.

II) (Minimizing the testing variance) We want a good amount of data points for testing. If too little a number is used for this purpose, you will not be able to trust estimate of error that is given by your testing framework- this is especially true if your luck is really bad and you happen to randomly choose the few bad data points from your dataset(the outliers) for testing.

III) We would want to run the whole procedure of testing+training multiple times, to ensure that our model consistently gives low estimates of error (giving a low estimate of error means your model itself has low bias, and doing so consistently means low variance too).

The best way to do this, is:

3. k-fold Cross-Validation

This is a brilliant way of achieving the bias-variance tradeoff in your testing process AND ensuring that your model itself has low bias and low variance. The testing procedure can be summarized as follows (where k is an integer) –

i. Divide your dataset randomly into k different parts.

ii. Repeat k times:

a. Choose a part out of the k available ones (one that hasn’t been chosen before, obviously).

b. Use the other k-1 parts together for training your model.

c. Use the part that was held-out in step ii.a., for testing. This will give you one of the k estimates of your model’s error.

For k=10, the process can be visualized as follows:

Choosing a good value of ‘k‘ ensures that your testing procedure gives you the best possible estimate of how well your model works. Choose a value too small, and you will drift towards the extremity of using the Validation Set Approach(biased testing). Choose a value too large, and it will be more like LOOCV (too much of variance in the testing procedure, and computationally intensive). Usually, a value between 5-10 is used in practical Machine Learning scenarios.

That brings us to the final question:

How do we assess the model’s bias-variance characteristics using k-fold Cross Validation?

When you perform k-fold CV, you get k different estimates of your model’s error- say $e_1, e_2, e_3, ..., e_k$. Since each $e_i$ is an error estimate, it should ideally be zero.

To check out you model’s bias, find out the mean of all the $e_i$s. If this value is low, it basically means that your model gives low error on an average– indirectly ensuring that your model’s notions about the data are accurate enough.

To check out your model’s variance, compute the standard deviation of all the $e_i$s. If this value is high, it means that your model’s performance(in terms of its error) varies a lot with the dataset used for training- something you don’t want.

Obviously, what is ‘high’ enough to rule out your model as being ineffective, depends on your particular application and your domain knowledge (and subsequently your skill as a data miner). Nevertheless, k-fold Cross-Validation is a powerful practical tool to quantify a model’s performance, and has been shown to pretty accurately estimate the real-world numbers in a wide range of scenarios.

# Linear and Quadratic Discriminant Analysis for ML / statistics newbies

(Note: This post assumes that the reader has knowledge of basic statistics and terms used in machine learning. Inspite of that, I have provided links whereever necessary 🙂 ).

Linear Discriminant Analysis(LDA) and Quadratic Discriminant Analysis(QDA) are types of Bayesian classifiers.

A Bayesian classifier, in mathematical terms, does the following-

$C^{bayesian}(x) = argmax_{r \in \{1, 2, ..., k\}} Pr(Y = r | X = x)$

What does this mean? To put it in the form of steps, heres what happens-

1. The classifier is given an input($X$) that is the feature vector $x$. $x$ consists of $p$ different predictors, one for each input dimension.

2. Then, with respect to each possible output class($Y$) $r$ from $1, 2, ..., k$, the classifier computes $Pr(Y = r | X = x)$ – this is the probability that the actual output is $r$, given $x$ as the input.

3. The class (i.e. that value of $r$) that returns the maximum value for the above mentioned probability, is given as the output of the classifier – $C^{bayesian}(x)$.

In essence, a Bayesian classifier tries to minimize the probability of misclassification (which would be equal to $1 - Pr(Y = r | X = x)$).

But then the most natural question is- How is this probability computed? It is precisely at this point where all Bayesian classifiers differ. Each one of them uses a different technique to understand the data, and compute these crucial quantities. In fact, even a simple nearest-neighbors classifier can be used as a Bayesian classifier, with $Pr(Y = r | X = x)$ being the fraction of the nearest neighbors to $x$  that belong to class $r$. LDA and QDA, on the other hand, use a more mathematical approach.

Anyone acquainted with basic statistics knows Bayes theorem. Given two possible events $A$ and $B$, it states-

$Pr(A|B) = \frac{Pr(A)Pr(B|A)}{Pr(B)}$

Applying this to $Pr(Y = r | X = x)$, we would get

$Pr(Y = r | X = x) = \frac{Pr(Y = r)Pr(X = x | Y = r)}{Pr(X = x)}$.

Now, look at the first part in the above numerator- $Pr(Y = r)$. This is essentially the probability of the class of a random input being $r$ – prior to even looking at what the input is! Thats why its called the priori probability. During the training stage itself, it is computed as the fraction of the training samples that belonged to class $r$. As you might have understood by now, this number is independent of the what $x$ is, and reflects the trends seen in the training set.

Now lets consider the second part – $Pr(X = x | Y = r)$. This number is conceptually equivalent to answering the question- Given that a random data point(not necessarily in the training data) was being selected from class $r$, what would be the likelihood that it looked like $x$? To quantify this ‘likelihood’, LDA and QDA use a Multivariate Gaussian Distribution model for each class. Mathematically put, the probability density function for a multivariate Gaussian distribution is-

$f_{\boldsymbol\Sigma, \boldsymbol\mu}(x) = \frac{1}{\sqrt{(2\pi)^k|\boldsymbol\Sigma|}}\exp(-\frac{1}{2}({ x}-{\boldsymbol\mu})^\mathrm{T}{\boldsymbol\Sigma}^{-1}({x}-{\boldsymbol\mu}))$.

You definitely don’t need to remember this huge-ass formula. To put it simply, heres the intuition behind it- A probability distribution model is a way for an algorithm to understand how data points are distributed in a $p$-dimensional space. In the case of the multivariate Gaussian(also called Normal) distribution, the two parameters $\boldsymbol\mu$ (the $p$-dimensional mean vector) and $\boldsymbol\Sigma$ (the $p$x$p$-dimensional covariance matrix) are used to ‘summarize’ how the data ‘looks’ in the conceptual space. These are obviously learnt during training.

While $\boldsymbol\mu$ denotes the approximate average position, $\boldsymbol\Sigma$ summarizes the ‘shape’ of the distribution of data points. (Imagine drawing a random cluster of points on a graph paper. Now draw an ellipse-like shape over that region where most of the points are concentrated. The ‘central’ point of the ellipse will be a 2-dimensional vector $\boldsymbol\mu$, and the ‘shape’ would be approximated by a 2×2 matrix given by $\boldsymbol\Sigma$.) The corresponding probability density function(pdf) for a random $x$ in space, is proportional to how likely it is that a point at that location would be selected randomly from the distribution. The pdf is to probability, what pressure is to force. Pressure isn’t exactly force itself, but it shows how intense the latter is at a particular location- and the higher the pressure, the greater is the likelihood that the force will show its effect at that point.

Thus, LDA and QDA use the aforementioned $f(x)$ to quantify $Pr(X = x | Y = r)$. What about the denominator then – $Pr(X = x)$? It is nothing but the sum of the values of $Pr(Y = r)Pr(X = x | Y = r)$ over all possible $r$s. This is given by the Total Probability Theorem.

But then where do LDA and QDA differ? They differ in only one subtle point- While LDA assumes the same $\boldsymbol\Sigma$ for all classes, QDA computes a different $\boldsymbol\Sigma$ for each class. Essentially, LDA computes a separate $\boldsymbol\mu$ for each class(using training points that belonged to it), but $\boldsymbol\Sigma$ is computed using the entire training data. And this common covariance matrix is used in the computations corresponding to every class. QDA, on the other hand, computes a separate $\boldsymbol\Sigma$ and $\boldsymbol\mu$ for each possible output class.

We will now tackle the two most important question that might be plaguing your mind-

1. Why is LDA ‘linear’, and QDA ‘quadratic’?

The answer lies in the probability function. Consider the simplest case of one input variable only. The answer can be generalized to multiple dimensions using matrix algebra.

$Pr(Y = r | X = x) = \frac{Pr(Y = r)Pr(X = x | Y = r)}{Pr(X = x)}$

We are maximizing over all possible $r$s right? Intuitively, that should tell you that the denominator in the above equation, doesn’t really matter in comparisons- since its the same for all values of $r$. So we are now down to finding that $r$ which gives the maximum value for

$Pr(Y = r)Pr(X = x | Y = r)$

Let $Pr(Y = r) = \beta_{r}$ (to reduce my LaTeX efforts). Using the formula for 1-D Normal distribution density function, we have

$Pr(X = x | Y = r) = \frac{1}{\sigma_r\sqrt{2\pi}}e^\frac{(x - \mu_r)^2}{\sigma_r^2}$

Multiplying the two and taking a natural logarithm of the resultant expression, we are now down to finding that $r$ that gives the maximum value for

$log(\beta_{r}/\sigma_r) - \frac{x^2}{2\sigma_r^2} - \frac{\mu_r^2}{2\sigma_r^2} + \frac{x\mu_r}{\sigma_r^2}$

Observe the above quadratic(in x) expression carefully. Once again, I have omitted those terms that are constants or which remain the same irrespective of $r$(NOT $x$), which is the focus of the whole process. Or have I? Consider the second term. IF I am talking of the LDA classifier, would $\sigma_r$ be dependent on $r$ at all? It wouldn’t, since LDA assumes the same variance/covariance for all classes. As a result, the second term could no longer be something worth taking into the comparison process. Incidentally, that very term is what makes the above expression quadratic. Otherwise, its linear!

And thats (more or less) why LDA is ‘linear’ and QDA is ‘quadratic’ 🙂 .

2. Why would we ever choose LDA? Isn’t QDA conceptually more flexible?

QDA isn’t always better. In a certain way(too mathematically convoluted for this post), LDA does learn a linear boundary between the points belonging to different classes. QDA, on the other hand, learns a quadratic one. If the actual boundaries are indeed linear, QDA may have a higher model bias. – especially if the available data isn’t dense enough. Another(and easier to imagine) scenario is when you have a very limited set of training points. In such a case, if you divide your already sparse dataset into the constituent classes, the covariance matrix computed for each might be extremely inaccurate. In such cases, its better to simplify the entire process and use a common covariance matrix thats computed from the entire dataset we use while training.

# Logistic Regression (for dummies)

(Note: This is a post attempting to explain the intuition behind Logistic Regression to readers NOT well acquainted with statistics. Therefore, you may not find any rigorous mathematical work in here.)

Logistic Regression is a type of classification algorithm involving a linear discriminant. What do I mean by that?

1. Unlike actual regression, logistic regression does not try to predict the value of a numeric variable given a set of inputs. Instead, the output is a probability that the given input point belongs to a certain class. For simplicity, lets assume that we have only two classes(for multiclass problems, you can look at Multinomial Logistic Regression), and the probability in question is $P_+$ -> the probability that a certain data point belongs to the ‘$+$‘ class. Ofcourse, $P_- = 1 - P_+$. Thus, the output of Logistic Regression always lies in [0, 1].

2. The central premise of Logistic Regression is the assumption that your input space can be separated into two nice ‘regions’, one for each class, by a linear(read: straight) boundary. So what does a ‘linear’ boundary mean? For two dimensions, its a straight line- no curving. For three dimensions, its a plane. And so on. This boundary will ofcourse be decided by your input data and the learning algorithm. But for this to make sense, it is clear that the data points MUST be separable into the two aforementioned regions by a linear boundary. If your data points do satisfy this constraint, they are said to be linear-separable. Look at the image below.

This dividing plane is called a linear discriminant, because 1. its linear in terms of its function, and 2. it helps the model ‘discriminate’ between points belonging to different classes.

(Note: If your points aren’t linearly separable in the original concept space, you could consider converting the feature vectors into a higher dimensional space by adding dimensions of interaction terms, higher degree terms, etc. Such usage of a linear algorithm in a higher dimensional space gives you some benefits of non-linear function learning, since the boundary would be non-linear if plotted back in the original input space.)

==========X===========

But how does Logistic Regression use this linear boundary to quantify the probability of a data point belonging to a certain class?

First, lets try to understand the geometric implication of this ‘division’ of the input space into two distinct regions. Assuming two input variables for simplicity(unlike the 3-dimensional figure shown above)- $x_1$ and $x_2$, the function corresponding to the boundary will be something like

$\beta_0 + \beta_1 x_1 + \beta_2 x_2$.

(It is crucial to note that $x_1$ and $x_2$ are BOTH input variables, and the output variable isn’t a part of the conceptual space- unlike a technique like linear regression.)

Consider a point $(a, b)$. Plugging the values of  $x_1$ and $x_2$ into the boundary function, we will get its output $\beta_0 + \beta_1 a + \beta_2 b$. Now depending on the location of $(a, b)$, there are three possibilities to consider-

I. $(a, b)$ lies in the region defined by points of the $+$ class. As a result, $\beta_0 + \beta_1 a + \beta_2 b$ will be positive, lying somewhere in (0, $\infty$). Mathematically, the higher the magnitude of this value, the greater is the distance between the point and the boundary. Intuitively speaking, the greater is the probability that $(a, b)$ belongs to the $+$ class. Therefore, $P_+$ will lie in (0.5, 1].

2. $(a, b)$ lies in the region defined by points of the $-$ class. Now, $\beta_0 + \beta_1 a + \beta_2 b$ will be negative, lying in (-$\infty$, 0). But like in the positive case, higher the absolute value of the function output, greater the probability that $(a, b)$ belongs to the $-$ class. $P_+$ will now lie in [0, 0.5).

3. $(a, b)$ lies ON the linear boundary. In this case, $\beta_0 + \beta_1 a + \beta_2 b = 0$. This means that the model cannot really say whether $(a, b)$ belongs to the $+$ or $-$ class. As a result, $P_+$ will be exactly 0.5.

Great. So now we have a function that outputs a value in (-$\infty$, $\infty$) given an input data point. But how do we map this to the probability $P_+$, that goes from [0, 1]? The answer, is in the odds function.

Let $P(X)$ denote the probability of an event $X$ occurring. In that case, the odds ratio ($OR(X)$) is defined by  $\frac{P(X)}{1-P(X)}$, which is essentially the ratio of the probability of the event happening, vs. it not happening. It is clear that probability and odds convey the exact same information. But as $P(X)$ goes from 0 to 1, $OR(X)$ goes from 0 to $\infty$.

However, we are still not quite there yet, since our boundary function gives a value from –$\infty$ to $\infty$. So what we do, is take the logarithm of $OR(X)$, called the log-odds function. Mathematically, as $OR(X)$ goes from 0 to $\infty$, $log(OR(X))$ goes from –$\infty$ to $\infty$!

So we finally have a way to interpret the result of plugging in the attributes of an input into the boundary function. The boundary function actually defines the log-odds of the $+$ class, in our model. So essentially, inour two-dimensional example, given a point $(a, b)$, this is what Logistic regression would do-

Step 1. Compute the boundary function(alternatively, the log-odds function) value, $\beta_0 + \beta_1 a + \beta_2 b$. Lets call this value $t$ for short.

Step 2. Compute the Odds Ratio, by doing $OR_+ = e^t$. (Since $t$ is the logarithm of $OR_+$).

Step 3. Knowing $OR_+$, it would compute $P_+$ using the simple mathematical relation

$P_+ = \frac{OR_+}{1 + OR_+}$.

There you go! In fact, once you know $t$ from step 1, you can combine steps 2 and 3 to give you

$P_+ = \frac{e^t}{1 + e^t}$

The RHS of the above equation is called the logistic function. Hence the name given to this model of learning :-).

==========X===========

We have now understood the intuition behind Logistic Regression, but the question remains- How does it learn the boundary function $\beta_0 + \beta_1 x_1 + \beta_2 x_2$? The mathematical working behind this is beyond the scope of this post, but heres a rough idea:

Consider a function $g(x)$, where $x$ is a data point in the training dataset. $g(x)$ can be defined in simple terms as:

If $x$ is a part of the $+$ class, $g(x) = P_+$ (Here, $P_+$ is the output given by your Logistic Regression model). If $x$ is a part of the $-$ class, $g(x) = 1 - P_+$.

Intuitively, $g(x)$ quantifies the probability that a training point was classified correctly by your model. Therefore, if you average $g(x)$ over your entire training data, you would get the likelihood that a random data point would be classified correctly by your system, irrespective of the class it belongs to. Simplifying things a little, it is this ‘average’ $g(x)$ that a Logistic Regression learner tries to maximize. The method adopted for the same is called maximum likelihood estimation (for obvious reasons). Unless you are a mathematician, you can do without learning how the optimization happens, as long as you have a good idea of what is being optimized – mostly because most statistics or ML libraries have inbuilt methods to get it done.

==========X===========

Thats all for now! And like all my blog posts, I hope this one helps some guy trying to Google up and learn some stuff on his own, understand the misunderstood technique of Logistic Regression. Cheers!

# On the Bias/Variance tradeoff in Machine Learning

One of the very first concepts any Machine Learning amateur comes across is the Bias-Variance tradeoff. However, inspite of how fundamental the idea is to the domain, I had always found it tricky to fully wrap my head around the it- especially from a practical understanding POV. Not to mention that literature regarding the topic on the internet isn’t that kind towards someone just starting his study on intelligent systems. Anyways, I recently started reading An Introduction to Statistical Learning, one of the best (neither too abstract, nor too rigorous) ML books I have ever come across. This book provides a good understanding of the aforementioned topic really well, right in the second chapter. And so, I thought I should write a blog post pouring out my understanding of the concept-  for others, and my own future self.

For prediction-based Statistical learning methods, the principle aim is to develop a model that can take in a set of predictor variables as input, and provide the best estimate of the dependent variables as the output. To understand how well your algorithm did its job, you usually use two types of (experimentally computed) quantities:

1. Training Error – A measure of how accurately your model predicts the outputs for the inputs used to train the model. Intuitively, as the complexity of your modeling technique increases, this error reduces.

2. Testing(or Cross-validation) Error – A measure of how accurately your model predicts the outputs for the inputs NOT used to train the model. This is the type of error measure you primarily want to reduce. Since your model was trained on the training data itself, it is how it performs on unseen inputs, that determines how ‘useful’ it really is in the real world.

Usually, the trends shown by the above two, are as follows:

As the complexity(or degree of fit) of your algorithm increases, the training error keeps on reducing. The testing error shows similar trends at first, until it starts going back up after a certain point. Why is that so? The reason, is overfitting.

In general, every dataset contains random patterns that are caused by variables you do not consider- whether you like it or not. Obviously, your model is supposed to ‘learn’ ONLY those patterns that are a result of your predictor variables- not these random trends you have no control over. Usually, if you select a model thats too simple for your application, its bad at learning pretty much any patterns in the data. This is called underfitting– not understanding even the essential trends in the data. However, as the flexibility of your model increases(i.e., as it starts getting more complex), it gets more sensitive to trends in the data, and starts learning more and more patterns. This causes the decrease in the cross-validation error.

But as the complexity goes on increasing beyond the required/ideal point, your model gets so sensitive that it starts to see patterns that are misleading. In essence, it has now started to model the random errors that aren’t actually a function of your predictor variables. Intuitively, this is plain wrong. This is what we call overfitting– learning those trends that are caused by random chance, rather than your input variables. As a result, the cross validation error starts increasing beyond a certain point.

Underfitting and overfitting now bring me to the main topics of this post:

I. Bias

Bias is the error thats caused because of your model intrinsically making assumptions that are far from reality. Essentially, this happens if the model you have chosen for learning is very different(usually way too simple) from the real relationship between your dependent and independent variables. Intuitively, its called so, because your modeling is ‘biased’ towards a wrong type of function. For example, if the relationship between Y(your output) and X(your input) is actually a degree-4 equation, and you use simple linear regression for modeling, your bias will be pretty high.

II. Variance

This is the error thats caused because of fluctuations in the training dataset. Ideally speaking, assuming the same problem and same source of data, your learned model should be the same(or approximately same) irrespective of which training dataset you use. In terms of an equation, the coefficients and the constant would approximately be the same independent of the dataset used for training. However, this may not be the case, if your algorithm is too sensitive and captures random patterns that are present only in that particular training set. In essence, variance is the error thats caused because your learning function varies too much with the data used for training.

Now back to the diagram shown earlier. When the degree of fit is too low, your bias error is pretty high – because your model is unable to fully grasp the relationship between the inputs and the outputs. As the degree goes on increasing, the model starts understanding things better – causing the bias error to fall. Ofcourse, the variance error is simultaneously increasing, since your algorithm is becoming more and more sensitive. However, the overall cross-validation error reduces because the fall in the bias error is far too high compared to the increase in the variance error. However, at a certain point, your model has become as intelligent as it possibly could. If you go on increasing the degree beyond this point, the increase in the variance error becomes too significant in front of the (possible) decrease in the bias error. This is the point where overfitting has begun, and your model is remembering trends it shouldn’t.

Therefore, the perfect bias-variance tradeoff can be summarized as- Your model should have low bias(the relationship your model assumes shouldn’t be too simple), and low variance(the relationship your model assumes shouldn’t be too convoluted to be true).

# Combining Genetic Algorithms and Recommendation Systems: Some thoughts

I had tinkered with Genetic Algorithms for a bit during college, working on a method to optimize stock portfolios using historical trends. Algorithms inspired from biology have always fascinated me with their elegance and intuitiveness. If you can relate an algorithm with something other than diagrams/numbers, there’s always a good chance you will understand and use it better. With concepts such as Swarm/Colony Optimization or Evolutionary Algorithms, the ease of understanding can be attributed to their roots in what we learn right in school.

Of course, they aren’t always the right way to go about doing things- especially when there exists an efficient and mathematically sound way of reaching the goal. For example, if Lagrangians can get you there fast enough, you probably don’t need a Particle Optimizer. The main reason being that most biologically inspired algorithms come with no guarantee- given a good definition of ‘utility’ and time, they can provide a good or even great solution- but not necessarily the best one(as promised by analytical algorithms). However, if your solution space is too large and/or you don’t fully understand the nature of your utility function(or it depends on too many factors), its a good idea to try stochastic optimization methods like genetic algorithms.

If you are new to genetic algorithms, heres a good place to learn the basics. All GAs follow the same basic format of working. What changes is how you represent your candidate solutions (the chromosomes) and how you perform key operations on them (genetic operators– selection, crossover, mutation). Selection is usually governed by the fitness function, which is perhaps the most important part of your GA- defining how good a solution is, for the problem you are trying to solve. Usually, the easier it is for you to define your fitness function, the less likely it is that you need a genetic algorithm for your problem (ironically).

Anyways, I was thinking of how one could use a genetic algorithm as a recommendation system. The link between the two is clearer if you think of recommendation-sets as evolving instead of getting generated. Let me explain a little more…

1. Chromosome

Your individual chromosome would be a set of recommendations, such as [r1, r2, …, rk]. By analogy, each of the ri‘s would denote a gene. The length of the recommendation-sets could be fixed or variable, depending on what you start out with, and what your crossover/mutation operators are.

2. Fitness function

Dealing with recommendation-sets instead of individual recommendations would allow you to judge the ‘fitness’ of a set of recommendations as a whole. The fitness of a candidate set would typically be:

a. Directly proportional to the average relevance of a recommendation to the user. (Using similarity functions such as the Jaccard coefficient on vectorial representations of recos and users- how you do it is upto you).

b. Inversely proportional to how many times the user has seen one of those recommendations in the past, and how long before.

c. Directly proportional to how diverse the recommendations are. (Maybe use something similar to an entropy for this?)

3. Selection

Selection could be done using a simple Roulette wheel, or using better explore/exploit techniques such as Boltzmann selection or Sigma scaling. These techniques allow you to explore more of your domain in the earlier stages, when you don’t know much about what the ideal solution should be. As the GA iterations progress, they give greater weightage to better solutions in the selection for ‘mating’.

4. Crossover

Crossovers could be simple single-point or two-point. Since the order of recommendations does not matter, even uniform crossovers would do the trick. Depending on the way you implement this, you could have variable-length recommendation sets (But if you prefer to not have too huge or too small sets, make the fitness function correspondingly dependent on the length of your candidate).

5. Mutation

Simply speaking, this would mean replacing/adding/subtracting a random recommendation in the set. This could be tricky if the domain of your recommendations is too large. In such a case, you could use some heuristics, such as adding items from ones most frequently bought/used, those used by users most similar to the current user, etc.

6. When to perform the iterations?

Every time there is a sufficient change in the user’s profile or the environment(other users/more recommendation items being added), you could run a given number of iterations of the GA. OR, when you have exhausted all the sets generated in the last run of the GA, by showing them one by one. The advantage of this approach(and the whole idea of using GAs for recommendation systems) is that you could use the candidates generated in the last run, as the starting population of the next run. Usually, the user’s profile and environment in consideration wouldn’t be radically different from the last run, and hence the past ‘best’ candidates would be a good place to start the new evolution process.

Ofcourse, this doesn’t solve many of the problems currently plaguing recommendation systems, such as cold-start. However, using biological evolution as the blueprint to keep generating newer sets of recommendations could be an interesting way to go about things :-).