An introduction to Bayesian Belief Networks

A Bayesian Belief Network (BBN), or simply Bayesian Network, is a statistical model used to describe the conditional dependencies between different random variables.

BBNs are chiefly used in areas like computational biology and medicine for risk analysis and decision support (basically, to understand what caused a certain problem, or the probabilities of different effects given an action).

Structure of a Bayesian Network

A typical BBN looks something like this:

bayesian-networks-a-brief-introduction-7-638

The shown example, ‘Burglary-Alarm‘ is one of the most quoted ones in texts on Bayesian theory. Lets look at the structural characteristics one by one. We will delve into the numbers/tables later.

Directed Acyclic Graph (DAG)

We obviously have one node per random variable.

Directed: The connections/edges denote cause->effect relationships between pairs of nodes. For example Burglary->Alarm in the above network indicates that the occurrence of a burglary directly affects the probability of the Alarm going off (and not the other way round). Here, Burglary is the parent, while Alarm is the child node.

Acyclic: There cannot be a cycle in a BBN. In simple English, a variable A cannot depend on its own value – directly, or indirectly. If this was allowed, it would lead to a sort of infinite recursion which we are not prepared to deal with. However, if you do realize that an event happening affects its probability later on, then you could express the two occurrences as separate nodes in the BBN (or use a Dynamic BBN).

Parents of a Node

One of the biggest considerations while building a BBN is to decide which parents to assign to a particular node. Intuitively, they should be those variables which most directly affect the value of the current node.

Formally, this can be stated as follows: “The parents of a variable X (parents(X)) are the minimal set of ancestors of X, such that all other ancestors of X are conditionally independent of X given parents(X)“.

Lets take this step by step. First off, there has to be some sort of a cause-effect relationship between Y and X for Y to be one of the ancestors of X. In the shown example, the ancestors of Mary Calls are Burglary, Earthquake and Alarm.

Now consider the two ancestors Alarm and Earthquake. The only way an Earthquake would affect Mary Calls, is if an Earthquake causes Alarm to go off, leading to Mary Calls. Suppose someone told you that Alarm has in fact gone off. In this case, it does not matter what lead to the Alarm ringing – since Mary will react to it based on the stimulus of the Alarm itself. In other words, Earthquake and Mary Calls become conditionally independent if you know the exact value of Alarm.

Mathematically speaking, P(Mary Calls|Alarm,Earthquake) == P(Mary Calls|Alarm).

Thus, parents(X) are those ancestors which do not become conditionally independent of X given the value of some other ancestor. If they do, then the resultant connection would actually be redundant.

Disconnected Nodes are Conditionally Independent

Based on the directed connections in a BBN, if there is no way to go from a variable X to Y (or vice versa), then X and Y are conditionally independent. In the example BBN, pairs of variables that are conditionally independent are {Mary Calls, John Calls} and {Burglary, Earthquake}.

It is important to remember that ‘conditionally independent’ does not mean ‘totally independent’. Consider {Mary Calls, John Calls}. Given the value of Alarm (that is, whether the Alarm went off or not), Mary and John each have their own independent probabilities of calling. However, if you did not know about any of the other nodes, but just that John did call, then your expectation of Mary calling would correspondingly increase.

Mathematics behind Bayesian Networks

BBNs provide a mathematically correct way of assessing the effects of different events (or nodes, in this context) on each other. And these assessments can be made in either direction – not only can you compute the most likely effects given the values of certain causes, but also determine the most likely causes of observed events.

The numerical data provided with the BBN (by an expert or some statistical study) that allows us to do this is:

  1. The prior probabilities of variables with no parents (Earthquake and Burglary in our example).
  2. The conditional probabilities of any other node given every value-combination of its parent(s). For example, the table next to Alarm defines the probability that the Alarm will go off given the whether an Earthquake and/or Burglary have occurred.

In case of continuous variables, we would need a conditional probability distribution.

The biggest use of Bayesian Networks is in computing revised probabilities. A revised probability defines the probability of a node given the values of one or more other nodes as a fact. Lets take an example from the Burglary-Alarm BBN.

Suppose we want to calculate the probability that an earthquake occurred, given that the alarm went off, but there was no burglary. Essentially, we want P(Earthquake|Alarm,\sim Burglary). Simplifying the nomenclature a bit, P(E|A,\sim B).

Here, you can say that the Alarm going off (A) is evidence, the knowledge that the Burglary did not happen (\sim B) is context and the Earthquake occurring (E) is the hypothesis. Traditionally, if you knew nothing else, P(E) = 0.002, from the diagram. However, with the context and evidence in mind, this probability gets changed/revised. Hence, its called ‘computing revised probabilities’.

A version of Bayes Theorem states that

P(X|YZ) = \frac{P(X|Z)P(Y|XZ)}{P(Y|Z)} …(1)

where X is the hypothesis, Y is the evidence, and Z is the context. The numerator on the RHS denotes that probability that XY both occur given Z, which is a subset of the probability that atleast Y occurs given Z, irrespective of X.

Using (1), we get

P(E|A, \sim B) = \frac{P(E|\sim B)P(A|\sim B, E)}{P(A|\sim B)} …(2)

Since E and B are independent phenomena without knowledge of A,

P(E|\sim B) = P(E) = 0.002 …(3)

From the table for A,

P(A|\sim B, E) = 0.29 …(4)

Finally, using the Total Probability Theorem,

P(A| \sim B) = P(E) P(A| E, \sim B) + P(\sim E) P(A| \sim E, \sim B) …(5)

Which is nothing but average of P(A| E, \sim B)P(A| \sim E, \sim B), weighted on P(E)P(\sim E) respectively.

Substituting values in (5),

P(A| \sim B) = 0.002 * 0.29 + 0.998 * 0.001 = 0.001578  …(6)

From (2), (3), (4), & (6), we get

P(E|A, \sim B) = 0.367

As you can see, the probability of the Earthquake actually increases if you know that the Alarm went off but a Burglary was not the cause of it. This should make sense intuitively as well. Which brings us to the final part –

The ‘Explain Away’ Effect

The Explain Away effect, commonly associated with BBNs, is a result of computing revised probabilities. It refers to the phenomenon where knowing that one cause has occurred, reduces (but does not eliminate) the probability that the other cause(s) took place.

Suppose instead of knowing that there has been no burglary like in our example, you infact did know that one has taken place. It also led to the Alarm going off. With this information in mind, your tendency to check out the ‘earthquake’ hypothesis reduces drastically. In other words, the burglary has explained away the alarm.

It is important to note that the probability for other causes just gets reduced, but does NOT go down to zero. In a stroke of bad luck, it could have happened that both a burglary and an earthquake happened, and any one of the two stimuli could have led to the alarm ringing. To what extent you can ‘explain away’ an evidence depends on the conditional probability distributions.

The basics of Likelihood

Anyone who has done some course in statistics or data science, must have come across the term ‘likelihood’. In non-technical language, likelihood is synonymous with probability. But ask any mathematician, and their interpretation of the two concepts is significantly different. I went digging into likelihood for a bit this week, so thought of putting down the basics of what I revisited.

Whenever we talk about understanding data, we talk about models. In statistics, a model is usually some sort of parameterized function – like a probability density function (pdf) or a regression model. But the effectiveness of the model’s outputs will only be as good as its fit to the data. If the characteristics of the data are very different from the assumed model, the bias turns out to be pretty high. So from a probability perspective, how do we quantify this fit?

Defining the Likelihood Function

The two entities of interest are – the data X, and the parameters \theta. Now consider a function F(X, \theta), that returns a number proportional to the degree of ‘fit’ between the two – essentially quantifying their relationship with each other.

There are two practical ways you could deal with this function. If you kept \theta constant and varied the data X being analyzed, you would be getting a function F_1(X) whose only argument is the data. The output would basically be a measure of how well your input data satisfies the assumptions made by the model.

But in real life, you rarely know your model with certainty. What you do have, is a bunch of observed data. So shouldn’t we also think of the other way round? Suppose you kept X constant, but tried varying \theta instead. Now, what you have got is a function F_2(\theta), that computes how well different sets of parameters describe your data (which you have for real).

Mind you, in both cases, the underlying mathematical definition is the same. The input ‘variable’ is what has changed. This is how probability and likelihood are related. The function F_1 is what we call the probability function (or pdf for the continuous case), and F_2 is called the likelihood function. While F_1 assumes you know your model and tries to analyze data according to it, F_2 keeps the data in perspective while figuring out how well different sets of parameters describe it.

The above definition might make you think that the likelihood is nothing but a rewording of probability. But keeping the data constant, and varying the parameters has huge consequences on the way you interpret the resultant function.

Lets take a simple example. Consider you have a set C_n of n different coin tosses, where r out of them were Heads, while the others were Tails. Lets say that the coin used for tossing was biased, and the probability of a Heads coming up on it is p. In this case,

F(C_n, p) = {n \choose r} p^r (1-p)^{(n - r)}

Now suppose you made coin yourself, so you know p = 0.7. In that case,

F_1(C_n) = {n \choose r} 0.7^r 0.3^{(n - r)}

On the other hand, lets say you don’t know much about the coin, but you do have a bunch of toss-outcomes from it. You made 10 different tosses, out which 5 were Heads. From this data, you want to measure how likely it is that your guess of p is correct. Then,

F_2(p) = 252 p^5 (1-p)^5

There is a very, very important distinction between probability and likelihood functions – the value of the probability function sums (or integrates, for continuous data) to 1 over all possible values of the input data. However, the value of the likelihood function does not integrate to 1 over all possible combinations of the parameters.

The above statement leads to the second important thing to note: DO NOT interpret the value of a likelihood function, as the probability of the model parameters. If your probability function gave the value of 0.7 (say) for a discrete data point, you could be pretty sure that there would be no other option as likely as it. This is because, the sum of the probabilities of all other point would be equal to 0.3. However, if you got 0.99 as the output of your likelihood function, it wouldn’t necessarily mean that the parameters you gave in are the most likely ones. Some other set of parameters might give 0.999, 0.9999 or something higher.

The only thing you can be sure of, is this: If F_2(\theta_1) >F_2(\theta_2), then it is more likely that \theta_1 denote the parameters of the underlying model.

Log Likelihoods for Maximum Likelihood Estimation

The likelihood function is usually denoted as L(\theta | x) (likelihood L of the parameters \theta given the data point x), so we will stick with it from now on. The most common use of likelihood, is to figure out that set of parameters which yields the highest value for it (and thus describes your dataset the best). This method is called Maximum Likelihood Estimation. You maximize the value of the likelihood function in a bid to find the optimal parameters for your model. This trick is applied in many areas of data science, such as logistic regression.

Maximum Likelihood Estimation usually involves computing the partial derivative of the likelihood function with respect to the parameters. We generally deal with the log-likelihood (basically the logarithm of the likelihood function) rather than likelihood itself. Since log is a monotonically increasing function, the optimum value of the likelihood function can be calculated using derivatives of log-likelihood as well. The reason we use logarithms, is to make the process of dealing with derivatives easy. Consider the coin-toss example I gave above:

Your Likelihood function for the probability of Heads, given the n and r, was:

L(p | r) = {n \choose r} p^r (1-p)^{(n - r)}

Computing log, we get

log(L(p | r)) = log ({n \choose r}) + r log(p) + (n - r) log(1 - p)

To maximise, we will compute the partial derivative with respect to p, and equate to zero.

Using \frac{d(log x)}{dx} = \frac{1}{x} we get,

\frac{r}{p} = \frac{n-r}{1-p}

Solving, we get the intuitive result:

p = \frac{r}{n}

In most cases, when you compute likelihood, you would be dealing with a bunch of independent data points x_1, x_2, ..., x_n, rather than a single one. The likelihood of \theta with respect to the data-set X then gets defined as follows:

L(\theta | X) = L(\theta | x_1) L(\theta | x_2) L(\theta | x_3) ... L(\theta | x_n)

Using log, the overall likelihood becomes:

log(L(\theta | X)) = \sum_x{log(L(\theta | x))}