Linear and Quadratic Discriminant Analysis for ML / statistics newbies

(Note: This post assumes that the reader has knowledge of basic statistics and terms used in machine learning. Inspite of that, I have provided links whereever necessary 🙂 ).

Linear Discriminant Analysis(LDA) and Quadratic Discriminant Analysis(QDA) are types of Bayesian classifiers.

A Bayesian classifier, in mathematical terms, does the following-

C^{bayesian}(x) = argmax_{r \in \{1, 2, ..., k\}} Pr(Y = r | X = x)

What does this mean? To put it in the form of steps, heres what happens-

1. The classifier is given an input(X) that is the feature vector x. x consists of p different predictors, one for each input dimension.

2. Then, with respect to each possible output class(Y) r from 1, 2, ..., k, the classifier computes Pr(Y = r | X = x) – this is the probability that the actual output is r, given x as the input.

3. The class (i.e. that value of r) that returns the maximum value for the above mentioned probability, is given as the output of the classifier – C^{bayesian}(x).

In essence, a Bayesian classifier tries to minimize the probability of misclassification (which would be equal to 1 - Pr(Y = r | X = x)).

But then the most natural question is- How is this probability computed? It is precisely at this point where all Bayesian classifiers differ. Each one of them uses a different technique to understand the data, and compute these crucial quantities. In fact, even a simple nearest-neighbors classifier can be used as a Bayesian classifier, with Pr(Y = r | X = x) being the fraction of the nearest neighbors to x  that belong to class r. LDA and QDA, on the other hand, use a more mathematical approach.

Anyone acquainted with basic statistics knows Bayes theorem. Given two possible events A and B, it states-

Pr(A|B) = \frac{Pr(A)Pr(B|A)}{Pr(B)}

Applying this to Pr(Y = r | X = x), we would get

Pr(Y = r | X = x) = \frac{Pr(Y = r)Pr(X = x | Y = r)}{Pr(X = x)}.

Now, look at the first part in the above numerator- Pr(Y = r). This is essentially the probability of the class of a random input being r – prior to even looking at what the input is! Thats why its called the priori probability. During the training stage itself, it is computed as the fraction of the training samples that belonged to class r. As you might have understood by now, this number is independent of the what x is, and reflects the trends seen in the training set.

Now lets consider the second part – Pr(X = x | Y = r). This number is conceptually equivalent to answering the question- Given that a random data point(not necessarily in the training data) was being selected from class r, what would be the likelihood that it looked like x? To quantify this ‘likelihood’, LDA and QDA use a Multivariate Gaussian Distribution model for each class. Mathematically put, the probability density function for a multivariate Gaussian distribution is-

f_{\boldsymbol\Sigma, \boldsymbol\mu}(x) = \frac{1}{\sqrt{(2\pi)^k|\boldsymbol\Sigma|}}\exp(-\frac{1}{2}({ x}-{\boldsymbol\mu})^\mathrm{T}{\boldsymbol\Sigma}^{-1}({x}-{\boldsymbol\mu})).

You definitely don’t need to remember this huge-ass formula. To put it simply, heres the intuition behind it- A probability distribution model is a way for an algorithm to understand how data points are distributed in a p-dimensional space. In the case of the multivariate Gaussian(also called Normal) distribution, the two parameters \boldsymbol\mu (the p-dimensional mean vector) and \boldsymbol\Sigma (the pxp-dimensional covariance matrix) are used to ‘summarize’ how the data ‘looks’ in the conceptual space. These are obviously learnt during training.

While \boldsymbol\mu denotes the approximate average position, \boldsymbol\Sigma summarizes the ‘shape’ of the distribution of data points. (Imagine drawing a random cluster of points on a graph paper. Now draw an ellipse-like shape over that region where most of the points are concentrated. The ‘central’ point of the ellipse will be a 2-dimensional vector \boldsymbol\mu, and the ‘shape’ would be approximated by a 2×2 matrix given by \boldsymbol\Sigma.) The corresponding probability density function(pdf) for a random x in space, is proportional to how likely it is that a point at that location would be selected randomly from the distribution. The pdf is to probability, what pressure is to force. Pressure isn’t exactly force itself, but it shows how intense the latter is at a particular location- and the higher the pressure, the greater is the likelihood that the force will show its effect at that point.

Thus, LDA and QDA use the aforementioned f(x) to quantify Pr(X = x | Y = r). What about the denominator then – Pr(X = x)? It is nothing but the sum of the values of Pr(Y = r)Pr(X = x | Y = r) over all possible rs. This is given by the Total Probability Theorem.

But then where do LDA and QDA differ? They differ in only one subtle point- While LDA assumes the same \boldsymbol\Sigma for all classes, QDA computes a different \boldsymbol\Sigma for each class. Essentially, LDA computes a separate \boldsymbol\mu for each class(using training points that belonged to it), but \boldsymbol\Sigma is computed using the entire training data. And this common covariance matrix is used in the computations corresponding to every class. QDA, on the other hand, computes a separate \boldsymbol\Sigma and \boldsymbol\mu for each possible output class.

We will now tackle the two most important question that might be plaguing your mind-

1. Why is LDA ‘linear’, and QDA ‘quadratic’?

The answer lies in the probability function. Consider the simplest case of one input variable only. The answer can be generalized to multiple dimensions using matrix algebra.

Pr(Y = r | X = x) = \frac{Pr(Y = r)Pr(X = x | Y = r)}{Pr(X = x)}

We are maximizing over all possible rs right? Intuitively, that should tell you that the denominator in the above equation, doesn’t really matter in comparisons- since its the same for all values of r. So we are now down to finding that r which gives the maximum value for

Pr(Y = r)Pr(X = x | Y = r)

Let Pr(Y = r) = \beta_{r} (to reduce my LaTeX efforts). Using the formula for 1-D Normal distribution density function, we have

Pr(X = x | Y = r) = \frac{1}{\sigma_r\sqrt{2\pi}}e^\frac{(x - \mu_r)^2}{\sigma_r^2}

Multiplying the two and taking a natural logarithm of the resultant expression, we are now down to finding that r that gives the maximum value for

log(\beta_{r}/\sigma_r) - \frac{x^2}{2\sigma_r^2} - \frac{\mu_r^2}{2\sigma_r^2} + \frac{x\mu_r}{\sigma_r^2}

Observe the above quadratic(in x) expression carefully. Once again, I have omitted those terms that are constants or which remain the same irrespective of r(NOT x), which is the focus of the whole process. Or have I? Consider the second term. IF I am talking of the LDA classifier, would \sigma_r be dependent on r at all? It wouldn’t, since LDA assumes the same variance/covariance for all classes. As a result, the second term could no longer be something worth taking into the comparison process. Incidentally, that very term is what makes the above expression quadratic. Otherwise, its linear!

And thats (more or less) why LDA is ‘linear’ and QDA is ‘quadratic’ 🙂 .

2. Why would we ever choose LDA? Isn’t QDA conceptually more flexible?

QDA isn’t always better. In a certain way(too mathematically convoluted for this post), LDA does learn a linear boundary between the points belonging to different classes. QDA, on the other hand, learns a quadratic one. If the actual boundaries are indeed linear, QDA may have a higher model bias. – especially if the available data isn’t dense enough. Another(and easier to imagine) scenario is when you have a very limited set of training points. In such a case, if you divide your already sparse dataset into the constituent classes, the covariance matrix computed for each might be extremely inaccurate. In such cases, its better to simplify the entire process and use a common covariance matrix thats computed from the entire dataset we use while training.

Advertisements

Efficient computation and storage of basic data statistics using Redis

This post describes a script for efficient computation and storage of the mean and variance corresponding to data from multiple sources. It uses Redis as a backend storage system. Though its written in Python(mainly cause I work with Django a lot), it can be translated into any popular language out there.

The basic assumptions are as follows:

1. You have data arriving from multiple sources, each possessing a unique ID(String) of some sort. For example, the location in case of weather data.

2. You need to store the mean and variance corresponding to non-overlapping time periods, like days(as given in the script).

3. You don’t want to/need to store the individual data points, just the averages. Moreover, you don’t want to access the permanent data storage (like a file on disk) everytime a new data point comes in, but only when the average stats need to be stored. The primary reason for this could be efficient resource usage.

4. You don’t want to store the process parameters as variables in the program, but rather using a better option like Redis. (Though you can tweak the script to do that too.)

Heres the script:


#Imports
from redis import StrictRedis
import datetime

#Redis Client for communicating with Redis
redisdb = StrictRedis()


def timestamp():
    """
    Returns a String that denotes a unique time-period.
    The output of this function determines the period over which
    the averaging is done.
    """
    return datetime.datetime.now().strftime("%d-%m-%Y")


def _fetch_state(identifier):
    """
    Fetches all the state data from Redis corresponding to the
    identifier string.
    If any part of it is not present, it returns empty data.
    Returns last date timestamp, the sum, the sum of squares,
    and the counter value (in that order).
    """

    #Get the string data from Redis
    last_timestamp = redisdb.get(identifier + '_last_timestamp')
    sigma = redisdb.get(identifier + '_sigma')
    squares_sigma = redisdb.get(
    identifier + '_squares_sigma')
    counter = redisdb.get(identifier + '_counter')

    #Check if any of the above is None(not present)
    #If not, parse the numbers
    if None in [last_timestamp, sigma, squares_sigma,
           counter]:
        #If any one is not available, the others are useless
        #Just reset the values
        last_timestamp = ''
        sigma = 0
        squares_sigma = 0
        counter = 0
    else:
        sigma = float(sigma)
        squares_sigma = float(squares_sigma)
        counter = int(counter)

    return last_timestamp, sigma, squares_sigma, counter


def _store_state(identifier, last_timestamp, sigma, squares_sigma,
    counter):
    """
    Stores the state data corresponding to the identifier, to Redis.
    """

    redisdb.set(identifier + '_last_timestamp', last_timestamp)
    redisdb.set(identifier + '_sigma', sigma)
    redisdb.set(identifier + '_squares_sigma',
        squares_sigma)
    redisdb.set(identifier + '_counter', counter)


def _permanent_store(identifier, mean, variance):
    """
    Stores statistical data to some kind of permanent storage
    (A file in this case).
    """
    f = open(identifier + '.txt', 'a')
    f.write(str(identifier) + ', ' + `mean` + ', ' + `variance` + '\n')
    f.close()


def record(identifier, value):
    """
    Records a numeric value corresponding to the identifier.
    """

    #Fetch state data
    last_timestamp, sigma, squares_sigma, counter = (
        _fetch_state(identifier))

    #Compute current timestamp
    current_timestamp = timestamp()

    if last_timestamp != current_timestamp:
        #Either a new time period has started, or
        #this is a new identifier, or
        #previous state data was lost for some reason.
        if counter != 0:
            #A new time period has started,
            #so compute parameters for previous one and store.
            counter = float(counter)
            mean = sigma/counter
            variance = squares_sigma/counter - mean**2

            _permanent_store(identifier, mean, variance)

        #Intialize state based on newly received data
        last_timestamp = current_timestamp
        sigma = value
        squares_sigma = value**2
        counter = 1

    else:
        #Same time period as before, update state
        sigma += value
        squares_sigma += value**2
        counter += 1

    #Store state
    _store_state(identifier, last_timestamp, sigma, squares_sigma,
        counter)

Some things to note:

1. The script is resistant to restarts (keeping in mind that any data that comes in during the down-time is lost). It’s also resistant to faults on Redis’ behalf, though any state data stored would then be lost. In both cases, the averaging information may be inaccurate, but not unavailable.

2. You can group sources together if you want (like averaging weather data from all sources in a region). You would just have to implement functionality to map each source ID to the relevant group ID.

3. The script computes averages over days, but using little creativity with the timestamp function, you can compute statistics over any custom time periods. In fact, you can modify the source IDs to tell the system what time interval to average over.

4. Though the input in the above code is single numbers, you can work with multidimensional data(using NumPy arrays/Pandas) and even other types of information extraction(that can work with streams without requiring historical data).

5. The script stores the data into a text file; you can obviously make it store the data to any other destination.

6. The code is pretty basic. To actually use it as a part of a bigger framework, I would suggest taking measures such as bundling the code into a class, implementing synchronization locks over the state data, etc. All that is upto you and your application ofcourse.

Thanks for reading!