$\DeclareMathOperator{\p}{P}$ $\DeclareMathOperator{\P}{P}$ $\DeclareMathOperator{\c}{^C}$ $\DeclareMathOperator{\or}{ or}$ $\DeclareMathOperator{\and}{ and}$ $\DeclareMathOperator{\var}{Var}$ $\DeclareMathOperator{\Var}{Var}$ $\DeclareMathOperator{\Std}{Std}$ $\DeclareMathOperator{\E}{E}$ $\DeclareMathOperator{\std}{Std}$ $\DeclareMathOperator{\Ber}{Bern}$ $\DeclareMathOperator{\Bin}{Bin}$ $\DeclareMathOperator{\Poi}{Poi}$ $\DeclareMathOperator{\Uni}{Uni}$ $\DeclareMathOperator{\Geo}{Geo}$ $\DeclareMathOperator{\NegBin}{NegBin}$ $\DeclareMathOperator{\Beta}{Beta}$ $\DeclareMathOperator{\Exp}{Exp}$ $\DeclareMathOperator{\N}{N}$ $\DeclareMathOperator{\R}{\mathbb{R}}$ $\DeclareMathOperator*{\argmax}{arg\,max}$ $\newcommand{\d}{\, d}$

Bayes' Theorem

Bayes' Theorem is one of the most ubiquitous results in probability for computer scientists. In a nutshell, Bayes' theorem provides a way to convert a conditional probability from one direction, say $\p(E|F)$, to the other direction, $\p(F|E)$.

Bayes' theorem is a mathematical identity which we can derive ourselves. Start with the definition of conditional probability and then expanding the $\and$ term using the chain rule: $$ \begin{align} \p(F|E) &= \frac{\p(F \and E)}{\p(E)} && \text{Def of } \href{ ../../part1/cond_prob/}{\text{conditional probability}} \\ &= \frac{\p(E | F) \cdot \p(F)}{\p(E)} && \text{Substitute the } \href{ ../../part1/cond_prob/#chain_rule}{\text{chain rule}} \text{ for $\p(F \and E)$} \end{align} $$

This theorem makes no assumptions about $E$ or $F$ so it will apply for any two events. Bayes' theorem is exceptionally useful because it turns out to be the ubiquitous way to answer the question: "how can I update a belief about something, which is not directly observable, given evidence." This is for good reason. For many "noisy" measurements it is straightforward to estimate the probability of the noisy observation given the true state of the world. However, what you would really like to know is the conditional probability the other way around: what is the probability of the true state of the world given evidence. There are countless real world situations that fit this situation:

Example 1: Medical tests
What you want to know: Probability of a disease given a test result
What is easy to know: Probability of a test result given the true state of disease
Causality: We believe that diseases influences test results

Example 2: Student ability
What you want to know: Student knowledge of a subject given their answers
What is easy to know: Likelihood of answers given a student's knowledge of a subject
Causality: We believe that ability influences answers

Example 3: Cell phone location
What you want to know: Where is a cell phone, given noisy measure of distance to tower
What is easy to know: Error in noisy measure, given the true distance to tower
Causality: We believe that cell phone location influences distance measure

There is a pattern here: in each example we care about knowing some unobservable -- or hard to observe -- state of the world. This state of the world "causes" some easy-to-observe evidence. For example: having the flu (something we would like to know) causes a fever (something we can easily observe), not the other way around. We often call the unobservable state the "belief" and the observable state the "evidence". For that reason lets rename the events! Lets call the unobservable thing we want to know $B$ for belief. Lets call the thing we have evidence of $E$ for evidence. This makes is clear that Bayes' theorem allows us to calculate an updated belief given evidence: $\p(B | E)$

Definition: Bayes' Theorem

The most common form of Bayes' Theorem is Bayes' Theorem Classic: $$ \p(B|E) = \frac{\p(E | B) \cdot \p(B)}{\p(E)} $$

There are names for the different terms in the Bayes' Rule formula. The term $\p(B|E)$ is often called the "posterior": it is your updated belief of $B$ after you take into account evidence $E$. The term $\p(B)$ is often called the "prior": it was your belief before seeing any evidence. The term $\p(E|B)$ is called the update and $\p(E)$ is often called the normalization constant.

There are several techniques for handling the case where the denominator is not know. One technique is to use the law of total probability to expand out the term, resulting in another formula, called Bayes' Theorem with Law of Total Probability: $$ \p(B|E) = \frac{\p(E | B) \cdot \p(B)}{\p(E|B)\cdot \p(B) + \p(E|B\c) \cdot \p(B\c)} $$

Recall the law of total probability which is responsible for our new denominator: $$ \begin{align} \p(E) = \p(E|B)\cdot \p(B) + \p(E|B\c) \cdot \p(B\c) \end{align} $$

A common scenario for applying the Bayes' Rule formula is when you want to know the probability of something “unobservable” given an “observed” event. For example, you want to know the probability that a student understands a concept, given that you observed them solving a particular problem. It turns out it is much easier to first estimate the probability that a student can solve a problem given that they understand the concept and then to apply Bayes' Theorem. Intuitively, you can think about this as updating a belief given evidence.

Bayes' Theorem Applied

Sometimes the (correct) results from Bayes' Theorem can be counter intuitive. Here we work through a classic result: Bayes' applied to medical tests. We show a dynamic solution and present a visualization for understanding what is happening.

Example: Probability of a disease given a noisy test

In this problem we are going to calculate the probability that a patient has an illness given test-result for the illness. A positive test result means the test thinks the patient has the illness. You know the following information, which is typical for medical tests:

The numbers in this example are from the Mamogram test for breast cancer. The seriousness of cancer underscores the potential for bayesian probability to be applied to important contexts. The natural occurence of breast cancer is 8%. The mamogram test returns a positive result 95% of the time for patients who have breast cancer. The test resturns a positive result 7% of the time for people who do not have breast cancer. In this demo you can enter different input numbers and it will reclaculate.


The probability that the patient has the illness given a positive test result is:

Let $I$ be the event that the patient has the illness
Let $E$ be the event that the test result is positive
$\p(I|E)$ = probability of the illness given a positive test. This is the number we want to calculate.
$\p(E|I)$ = probability of a positive result given illness =
$\p(E|I\c)$ = probability of a positive result given no illness =
$\p(I)$ = natural probability of the illness =

Bayes Theorem:
In this problem we know $\p(E|I)$ and $\p(E|I\c)$ but we want to know $\p(I|E)$. We can apply Bayes Theorem to turn our knowledge of one conditional into knowledge of the reverse.

$$\begin{align}\p(I|E) &= \frac{\p(E|I)P(I)}{\p(E|I)\p(I) + \p(E|I\c)\p(I\c)} && \text{Bayes' Theorem with Total Prob.}\end{align}$$

Now all we need to do is plug values into this formula. The only value we don't explicitly have is $\p(I\c)$. But we can simply calculate it since $\p(I\c) = 1 - \p(I)$. Thus:

Natural Frequency Intuition

One way to build intuition for Bayes Theorem is to think about "natural frequences". Let's take another approach at answer the probability question in the above example on belief of illness given a test. In this take, we are going to imagine we have a population of 1000 people. Let's think about how many of those have the illness and test positive and how many don't have the illness and test positive. This visualization is based off the numbers in the fields above. Feel free to change them!

There are many possibilities for how many people have the illness, but one very plaussible number is 1000, the number of people in our population, multiplied by the probability of the disease.

$1000 \times \p(\text{Illness})$ people have the illness
$1000 \times (1- \p(\text{Illness}))$ people do not have the illness.

We are going to color people who have the illness in blue and those without the illness in pink (those colors do not imply gender!).

A certain number of people with the illness will test positive (which we will draw in Dark Blue) and a certain number of people without the illness will test positive (which we will draw in Dark Pink):

$1000 \times \p(\text{Illness}) \times \p(\text{Positive}|\text{Illness})$ people have the illness and test positive
$1000 \times \p(\text{Illness}\c) \times \p(\text{Positive}|\text{Illness}\c)$ people do not have the illness and test positive.

Here is the whole population of 1000 people:

The number of people who test positive and have the illness is ?.
The number of people who test positive and don't have the illness is ?.
The total number of people who test positive is ?.

Out of the subset of people who test positive, the fraction that have the illness is ?/? = ? which is a close approximation of the answer. If instead of using 1000 imaginary people, we had used more, the approximation would have been even closer to the actual answer (which we calculated using Bayes Theorem).

Bayes with the General Law of Total Probability

A classic challenge when applying Bayes' theorm is to calculate the probability of the normalization constant $\p(E)$ in the denominator of Bayes' Theorem. One common strategy for calculating this probability is to use the law of total probability. Our expanded version of Bayes' Theorem uses the simple version of the total law of probability: $\p(E) = \p(E|F)\p(F) + \p(E|F^c) \p(F^c)$. Sometimes you will want the more expanded version of the law of total probability: $\p(E) = \sum_i\p(E|B_i)\p(B_i)$. Recall that this only works if the events $B_i$ are mutually exclusive and cover the sample space.

For example say we are trying to track a phone which could be in any one of $n$ discrete locations and we have prior beliefs $\p(B_1) \dots \p(B_n)$ as to whether the phone is in location $B_i$. Now we gain some evidence (such as a particular signal strength from a particular cell tower) that we call $E$ and we need to update all of our probabilities to be $\p(B_i |E)$. We should use Bayes' Theorem!

The probability of the observation, assuming that the the phone is in location $B_i$, $\p(E|B_i)$, is something that can be given to you by an expert. In this case the probability of getting a particular signal strength given a location $B_i$ will be determined by the distance between the cell tower and location $B_i$ .

Since we are assuming that the phone must be in exactly one of the locations, we can find the probability of any of the event $B_i$ given $E$ by first applying Bayes' Theorem and then applying the general version of the law of total probability:

$$ \begin{align} \p(B_i | E) &= \frac{\p(E|B_i) \cdot \p(B_i)}{\p(E)} && \text{Bayes Theorem. What to do about $\p(E)$?} \\ &= \frac{\p(E|B_i) \cdot \p(B_i)}{\sum_{i=1}^n \p(E|B_i) \cdot \p(B_i)} && \text{Use General Law of Total Probability for $\p(E)$} \\ \end{align} $$

Unknown Normalization Constant

There are times when we would like to use Bayes' Theorem to update a belief, but we don't know the probability of $E$, $\p(E)$. All hope is not lost. This term is called the "normalization constant" because it is the same regardless of whether or not the event $B$ happens. The most traditional solution is to use the law of total probability: $\p(E) = \p(E |B) \p(B) + \p(E|B\c)\p(B\c)$. Here are some other useful "tricks" for dealing with $\p(E)$.

We can make the normalization cancel out by calculating the ratio of $\frac{\p(B|E)}{\p(B\c|E)}$. This fraction tells you how many times more likely it is that $B$ will happen given $E$ than not $B$: $$ \begin{align} \frac{\p(B|E)}{\p(B\c|E)} &= \frac{ \frac{\p(E|B)\p(B)}{\p(E)} }{ \frac{\p(E|B\c)\p(B\c)}{\p(E)} } && \text{Apply Bayes' Theorm to both terms} \\ &= \frac{ \p(E|B)\p(B) }{ \p(E|B\c)\p(B\c) } && \text{The term $\p(E)$ cancels} \end{align} $$

We can always use the fact that either $B$ will happen or it won't when consistently conditioned on $E$: $\p(B |E) + \p(B\c|E) =1$ to compute $\p(E)$. Note that this is the simply the first identity of probability, consistently conditioning: $$ \begin{align} 1 &= \p(B|E) + \p(B\c|E) && \text{Either $B$ occurs or it doesn't} \\ 1 &= \frac{\p(E|B)\p(B)}{\p(E)} + \frac{\p(E|B\c)\p(B\c)}{\p(E)} && \text{Apply Bayes' Theorem to both terms} \\ 1 &= \frac{1}{\p(E)} \cdot \big[\p(E|B)\p(B) + \p(E|B\c)\p(B\c)\big] && \text{Factor out $1/ \p(E)$} \\ \p(E) &= \p(E|B)\p(B) + \p(E|B\c)\p(B\c) && \text{Rearrange terms} \end{align} $$ If you look closely at the last line, you will notice that we have simply found a new way to derive the total law of probability for $E$. The law of total probability is truly a great way of dealing with $\p(E)$.