# The basics of Likelihood

Anyone who has done some course in statistics or data science, must have come across the term ‘likelihood’. In non-technical language, likelihood is synonymous with probability. But ask any mathematician, and their interpretation of the two concepts is significantly different. I went digging into likelihood for a bit this week, so thought of putting down the basics of what I revisited.

Whenever we talk about understanding data, we talk about models. In statistics, a model is usually some sort of parameterized function – like a probability density function (pdf) or a regression model. But the effectiveness of the model’s outputs will only be as good as its fit to the data. If the characteristics of the data are very different from the assumed model, the bias turns out to be pretty high. So from a probability perspective, how do we quantify this fit?

### Defining the Likelihood Function

The two entities of interest are – the data $X$, and the parameters $\theta$. Now consider a function $F(X, \theta)$, that returns a number proportional to the degree of ‘fit’ between the two – essentially quantifying their relationship with each other.

There are two practical ways you could deal with this function. If you kept $\theta$ constant and varied the data $X$ being analyzed, you would be getting a function $F_1(X)$ whose only argument is the data. The output would basically be a measure of how well your input data satisfies the assumptions made by the model.

But in real life, you rarely know your model with certainty. What you do have, is a bunch of observed data. So shouldn’t we also think of the other way round? Suppose you kept $X$ constant, but tried varying $\theta$ instead. Now, what you have got is a function $F_2(\theta)$, that computes how well different sets of parameters describe your data (which you have for real).

Mind you, in both cases, the underlying mathematical definition is the same. The input ‘variable’ is what has changed. This is how probability and likelihood are related. The function $F_1$ is what we call the probability function (or pdf for the continuous case), and $F_2$ is called the likelihood function. While $F_1$ assumes you know your model and tries to analyze data according to it, $F_2$ keeps the data in perspective while figuring out how well different sets of parameters describe it.

The above definition might make you think that the likelihood is nothing but a rewording of probability. But keeping the data constant, and varying the parameters has huge consequences on the way you interpret the resultant function.

Lets take a simple example. Consider you have a set $C_n$ of $n$ different coin tosses, where $r$ out of them were $Heads$, while the others were $Tails$. Lets say that the coin used for tossing was biased, and the probability of a $Heads$ coming up on it is $p$. In this case,

$F(C_n, p) = {n \choose r} p^r (1-p)^{(n - r)}$

Now suppose you made coin yourself, so you know $p = 0.7$. In that case,

$F_1(C_n) = {n \choose r} 0.7^r 0.3^{(n - r)}$

On the other hand, lets say you don’t know much about the coin, but you do have a bunch of toss-outcomes from it. You made 10 different tosses, out which 5 were $Heads$. From this data, you want to measure how likely it is that your guess of $p$ is correct. Then,

$F_2(p) = 252 p^5 (1-p)^5$

There is a very, very important distinction between probability and likelihood functions – the value of the probability function sums (or integrates, for continuous data) to 1 over all possible values of the input data. However, the value of the likelihood function does not integrate to 1 over all possible combinations of the parameters.

The above statement leads to the second important thing to note: DO NOT interpret the value of a likelihood function, as the probability of the model parameters. If your probability function gave the value of 0.7 (say) for a discrete data point, you could be pretty sure that there would be no other option as likely as it. This is because, the sum of the probabilities of all other point would be equal to 0.3. However, if you got 0.99 as the output of your likelihood function, it wouldn’t necessarily mean that the parameters you gave in are the most likely ones. Some other set of parameters might give 0.999, 0.9999 or something higher.

The only thing you can be sure of, is this: If $F_2(\theta_1) >F_2(\theta_2)$, then it is more likely that $\theta_1$ denote the parameters of the underlying model.

### Log Likelihoods for Maximum Likelihood Estimation

The likelihood function is usually denoted as $L(\theta | x)$ (likelihood $L$ of the parameters $\theta$ given the data point $x$), so we will stick with it from now on. The most common use of likelihood, is to figure out that set of parameters which yields the highest value for it (and thus describes your dataset the best). This method is called Maximum Likelihood Estimation. You maximize the value of the likelihood function in a bid to find the optimal parameters for your model. This trick is applied in many areas of data science, such as logistic regression.

Maximum Likelihood Estimation usually involves computing the partial derivative of the likelihood function with respect to the parameters. We generally deal with the log-likelihood (basically the logarithm of the likelihood function) rather than likelihood itself. Since log is a monotonically increasing function, the optimum value of the likelihood function can be calculated using derivatives of log-likelihood as well. The reason we use logarithms, is to make the process of dealing with derivatives easy. Consider the coin-toss example I gave above:

Your Likelihood function for the probability of $Heads$, given the $n$ and $r$, was:

$L(p | r) = {n \choose r} p^r (1-p)^{(n - r)}$

Computing log, we get

$log(L(p | r)) = log ({n \choose r}) + r log(p) + (n - r) log(1 - p)$

To maximise, we will compute the partial derivative with respect to $p$, and equate to zero.

Using $\frac{d(log x)}{dx} = \frac{1}{x}$ we get,

$\frac{r}{p} = \frac{n-r}{1-p}$

Solving, we get the intuitive result:

$p = \frac{r}{n}$

In most cases, when you compute likelihood, you would be dealing with a bunch of independent data points $x_1, x_2, ..., x_n$, rather than a single one. The likelihood of $\theta$ with respect to the data-set $X$ then gets defined as follows:

$L(\theta | X) = L(\theta | x_1) L(\theta | x_2) L(\theta | x_3) ... L(\theta | x_n)$

Using log, the overall likelihood becomes:

$log(L(\theta | X)) = \sum_x{log(L(\theta | x))}$

# Triangular numbers and infinite primes

Last week I started reading From Mathematics to Generic Programming by Alexander Stepanov. I had been searching a long time for books that talk about programming as a science in the context of mathematics, and this book is an engaging read to that end! Its not an algorithms cookbook or an Algorithms Analysis text (for that, something like CLRS would be far far better). Instead, this book takes you on a trip of the history of arithmetic, starting from Pythagorean times. You are introduced to many of the elegant algorithms (mostly number-theory related, from what I see after reading the first two chapters) and their software implementations. What I love about this book is the way it explains the thought process involved in ensuring efficient generic implementations of even the simplest of algorithms. For example, the first chapter itself takes you through 5-7 different implementations of the Sieve of Eratosthenes, each one improving some particular aspect of the preceding one.

There is also a heavy focus on thinking of programming like you think while solving math problems, and this can be seen in the way they compare algebra to generic programming itself. Whenever you write an algebraic expression such as $2x + 1$, you have essentially come up with an ‘algorithm’ (program to convert an input to an output) that does not care about the exact nature of $x$ (not caring about the exact subclass), as long as its a number that can be interpreted by your system (following the interfaces of a base class).

Here are two ‘thoughts’ from ancient math, I came across for the first time while reading this book:

1) Pythagorean ‘Triangular’ Numbers

Pythagorean mathematics was well known for interpreting and thinking of numbers from a geometric perspective. This particular derivation shows the intelligence behind it.

Pythagoreans had this concept of Triangular numbers. The nth Triangular number, $T_n$, is defined as the total obtained from stacking the first n natural numbers in the form of a triangle. For example, $T_4$ could be shown geometrically as

X
XX
XXX
XXXX

The actual ‘number’ is essentially the total number of X’s in the figure above. In other words, $T_n$ is the sum of the first n natural numbers.

We now come to Oblong numbers. The nth Oblong number ($O_n$) is what you would get after stacking $T_n$ onto itself. For n=4, it would be something like this-

XOOOO
XXOOO
XXXOO
XXXXO

In essence,

$O_n = 2T_n$ …(1)

Now look at the figure for $O_4$. You will realise that its a rectangle with dimensions 4 and 5. Generalising, we know that nth Oblong number would correspond to a rectangle of dimensions n and n+1. $O_n$ would be equal to the area of this rectangle, and hence you can say

$O_n = n(n+1)$ …(2)

From equations 1 and 2, you could derive the result you know from high school math:

$2T_n = n(n+1)$

Therefore,

$T_n = \frac{n(n+1)}{2}$

which defines the expression for calculating the sum of the first n natural numbers.

A nice way to look at elementary arithmetic isn’t it?

2) Euclid’s Proof for Infinite Primes

We all know there are an infinite number of primes out there. People who have taken a course on Cryptography in college would also know how important they are from a computer-science and security perspective. But how many of us, not mathematics majors, can actually give a proof for the existence of infinite number of primes? I couldn’t. Atleast not until I read this book anyways. The proof for this, was given back millennia ago by Euclid. Heres how it goes:

Assume you have collected as many primes as you know, in a set:

$\{p_1, p_2, ..., p_n\}$

You claim that there are no more primes in the world.

To disprove your claim, I generate the following number:

$N = \prod_{i = 1}^{n} p_i + 1$

$N$ is 1 added to the product of all the primes in the set.

Now there are two cases for $N$

i. Its prime – In which case, I need not say more.

ii. Its composite

Basic number theory tells us, that given a natural number greater than 1 – say $p$, and another natural number $q$, the number $(pq + 1)$ will NOT be divisible by p. Cause if it was, it would mean ‘1’ is divisible by $p$, which is not possible.

Therefore $N$ is a non-prime number that is NOT divisible by any of the primes you came up with. And since every composite number has to have a prime factor, its proven that there exists a prime NOT included in the set you so proudly came up with 😛 . I love the simplicity and (yet) brilliance of this proof.

If you love mathematics and want to read a book that makes you think of basic math as much as coding, I would definitely recommend this one. Apart from these brilliant results, it could well improve the way you optimise the software you write- wisdom that a book on Software Engineering won’t give you.