$\DeclareMathOperator{\p}{P}$ $\DeclareMathOperator{\P}{P}$ $\DeclareMathOperator{\c}{^C}$ $\DeclareMathOperator{\or}{ or}$ $\DeclareMathOperator{\and}{ and}$ $\DeclareMathOperator{\var}{Var}$ $\DeclareMathOperator{\Var}{Var}$ $\DeclareMathOperator{\Std}{Std}$ $\DeclareMathOperator{\E}{E}$ $\DeclareMathOperator{\std}{Std}$ $\DeclareMathOperator{\Ber}{Bern}$ $\DeclareMathOperator{\Bin}{Bin}$ $\DeclareMathOperator{\Poi}{Poi}$ $\DeclareMathOperator{\Uni}{Uni}$ $\DeclareMathOperator{\Geo}{Geo}$ $\DeclareMathOperator{\NegBin}{NegBin}$ $\DeclareMathOperator{\Beta}{Beta}$ $\DeclareMathOperator{\Exp}{Exp}$ $\DeclareMathOperator{\N}{N}$ $\DeclareMathOperator{\R}{\mathbb{R}}$ $\DeclareMathOperator*{\argmax}{arg\,max}$ $\newcommand{\d}{\, d}$

Marginalization


An important insight regarding probabilistic models with many random variables is that "the joint distribution is complete information." From the joint distribution you can compute all probability questions involving those random variables in the model. This chapter is an example of that insight.

The central question of this chapter is: Given a joint distribution, how can you compute the probability of random variables on their own?

Marginalization From Two Random Variables

To start, consider two random variables $X$ and $Y$. If you are given the joint how can you compute $\p(X=x)$? Recall that if you have the joint you have a way to know the probability $\p(X=x,Y=y)$ for any value $x$ and $y$ . We already have a technique for computing $\p(X=x)$ from the joint. We can use the Law of Total Probability (LOTP)! In this case the events $Y = y$ make up the "background events": \begin{align*} P(X=x) &= \sum_y \p(X=x, Y=y) \\ \end{align*} Note that to apply the LOTP it must be the case that the different events $Y=i$ must be mutually exclusvie and it must be the case that $\sum_y \p(Y=y) = 1$. Both are true.

If we wanted $\p(Y=y)$ we could again use the Law of Total Probability, this time with $X$ taking on each of its possible values as the background events: \begin{align*} P(Y=y) &= \sum_x \p(X=x, Y=y) \\ \end{align*}

Example: Favorite Number

Consider the following joint distribution for $X$ and $Y$ where $X$ is a person's favorite binary digit and $Y$ is their year at Stanford. Here is a real joint distribution form a past class:

Variable Symbol Type
Favorite Digit $X$ Discrete number {0, 1}
Year in School $Y$ Categorical {Frosh, Soph, Junior, Senior, 5+}

What is the probability that a student's favorite digit is 0, $\p(X=0)$? We can use the LOTP to compute this probability: \begin{align*} \p(X=0) = &\sum_y \p(X=0, Y=y) \\ = &+ \p(X=0, Y=\text{Frosh}) \\ &+ \p(X=0, Y=\text{Soph}) \\ &+ \p(X=0, Y=\text{Junior}) \\ &+ \p(X=0, Y=\text{Senior}) \\ &+ \p(X=0, Y=\text{5+}) \\ = &0.01 + 0.05 + 0.04 + 0.03 + 0.02 \\ = &0.15 \end{align*}

Marginalization with More Variables

The idea of marginalization can be extended to joint distributions with more than two random variables. Consider having three random variables $X$, $Y$, and $Z$, we could marginalize out any of the variables: \begin{align*} P(X=x) &= \sum_{y,z} \p(X=x, Y=y, Z=z) \\ P(Y=y) &= \sum_{x,z} \p(X=x, Y=y, Z=z) \\ P(Z=z) &= \sum_{x,y} \p(X=x, Y=y, Z=z) \\ \end{align*}

Notation: Double Sum
In this case the double sum notation: $$\sum\limits_{y, z}$$ Also written equivalently as: $$\sum_y \sum_z$$ means that we are summing over all possible values of $y$ and $z$. For example, if $Y$ is a random variable with 3 possible values and $Z$ is a random variable with 4 possible values, then $\sum_{y, z}$ means that we are summing over all 12 possible combinations of $y$ and $z$.

Here is an example in code. Assume we have a function joint(x, y, z) and that all $X$, $Y$ and $Z$ can take on values in the set $\{0, 1, 2, 3, 4\}$