$\DeclareMathOperator{\p}{P}$ $\DeclareMathOperator{\P}{P}$ $\DeclareMathOperator{\c}{^C}$ $\DeclareMathOperator{\or}{ or}$ $\DeclareMathOperator{\and}{ and}$ $\DeclareMathOperator{\var}{Var}$ $\DeclareMathOperator{\Var}{Var}$ $\DeclareMathOperator{\Std}{Std}$ $\DeclareMathOperator{\E}{E}$ $\DeclareMathOperator{\std}{Std}$ $\DeclareMathOperator{\Ber}{Bern}$ $\DeclareMathOperator{\Bin}{Bin}$ $\DeclareMathOperator{\Poi}{Poi}$ $\DeclareMathOperator{\Uni}{Uni}$ $\DeclareMathOperator{\Geo}{Geo}$ $\DeclareMathOperator{\NegBin}{NegBin}$ $\DeclareMathOperator{\Beta}{Beta}$ $\DeclareMathOperator{\Exp}{Exp}$ $\DeclareMathOperator{\N}{N}$ $\DeclareMathOperator{\R}{\mathbb{R}}$ $\DeclareMathOperator*{\argmax}{arg\,max}$ $\newcommand{\d}{\, d}$

Maximum A Posteriori


MLE is great, but it is not the only way to estimate parameters! This section introduces an alternate algorithm, Maximum A Posteriori (MAP).The paradigm of MAP is that we should chose the value for our parameters that is the most likely given the data. At first blush this might seem the same as MLE, however notice that MLE chooses the value of parameters that makes the \emph{data} most likely. Formally, for IID random variables $X_1, \dots, X_n$: \begin{align*} \theta_{\text{MAP}} =& \underset{\theta}{\operatorname{argmax }} \text{ } f(\Theta = \theta | X_1=x_1, X_2=x_2, \dots X_n=x_n) \end{align*} In the equation above we trying to calculate the conditional probability of unobserved random variables given observed random variables. When that is the case, think Bayes Theorem! Expand the function $f$ using the continuous version of Bayes Theorem. \begin{align*} \theta_{\text{MAP}} =& \underset{\theta}{\operatorname{argmax }} \text{ } f(\Theta = \theta | X_1=x_1, X_2=x_2, \dots X_n=x_n) \\ =& \underset{\theta}{\operatorname{argmax }} \text{ } \frac{f(X_1=x_1, X_2=x_2, \dots, X_n=x_n |\Theta = \theta) f(\Theta = \theta)}{f(X_1=x_1, X_2=x_2, \dots X_n=x_n)} && \text{Bayes} \end{align*} Note that $f$ are all probability densities or probability mass functions. Now we are going to leverage two observations. First, the data is assumed to be IID so we can decompose the density of the data given $\theta$. Second, the denominator is a constant with respect to $\theta$. As such its value does not affect the argmax and we can drop that term. Mathematically: \begin{align*} \theta_{\text{MAP}} =& \underset{\theta}{\operatorname{argmax }} \text{ } \frac{f(\Theta = \theta) \cdot \prod_{i=1}^n f(X_i =x_i| \Theta = \theta) }{f(X_1 = x_1, X_2 = x_2, \dots X_n = x_n)} && \text{Since the samples are IID}\\ =& \underset{\theta}{\operatorname{argmax }} \text{ } f(\Theta = \theta) \cdot\prod_{i=1}^n f(X_i =x_i |\Theta = \theta) && \text{Denom. is constant with respect to $\theta$} \end{align*} As before, it will be more convenient to find the argmax of the log of the MAP function, which gives us the final form for MAP estimation of parameters. \begin{align*} \theta_{\text{MAP}} =& \underset{\theta}{\operatorname{argmax }} \text{ } \left( \log f(\Theta = \theta) + \sum_{i=1}^n \log(f(X_i =x_i|\Theta = \theta)) \right) \end{align*} Using Bayesian terminology, the MAP estimate is the mode of the "posterior" distribution for $\theta$. If you look at this equation side by side with the MLE equation you will notice that MAP is the argmax of the exact same function \emph{plus} a term for the log of the prior.

Parameter Priors

In order to get ready for the world of MAP estimation, we are going to need to brush up on our distributions. We will need reasonable distributions for each of our different parameters. For example, if you are predicting a Poisson distribution, what is the right random variable type for the prior of $\lambda$?

A desiderata for prior distributions is that the resulting posterior distribution has the same functional form. We call these "conjugate" priors. In the case where you are updating your belief many times, conjugate priors makes programming in the math equations much easier.

Here is a list of different parameters and the distributions most often used for their priors: \begin{align*} &\text{Parameter }&& \text{Distribution}\\ &\text{Bernoulli } p && \text{Beta}\\ &\text{Binomial } p && \text{Beta}\\ &\text{Poisson } \lambda && \text{Gamma}\\ &\text{Exponential } \lambda && \text{Gamma}\\ &\text{Multinomial } p_i && \text{Dirichlet}\\ &\text{Normal } \mu && \text{Normal}\\ &\text{Normal } \sigma^2 && \text{Inverse Gamma} \end{align*} You are only expected to know the new distributions on a high level. You do not need to know Inverse Gamma. I included it for completeness.

The distributions used to represent your "prior" belief about a random variable will often have their own parameters. For example, a Beta distribution is defined using two parameters $(a, b)$. Do we have to use parameter estimation to evaluate $a$ and $b$ too? No. Those parameters are called "hyperparameters". That is a term we reserve for parameters in our model that we fix before running parameter estimate. Before you run MAP you decide on the values of $(a, b)$.

Dirichlet

The Dirichlet distribution generalizes Beta in same way Multinomial generalizes Bernoulli. A random variable $X$ that is Dirichlet is parametrized as $X \sim \text{Dirichlet}(a_1, a_2, \dots, a_m)$. The PDF of the distribution is: \begin{align*} f(X_1 = x_1, X_2 = x_2, \dots, X_m = x_m) = K \prod_{i=1}^m x_i^{a_i - 1} \end{align*} Where $K$ is a normalizing constant.

You can intuitively understand the hyperparameters of a Dirichlet distribution: imagine you have seen $\sum_{i=1}^m a_i - m$ imaginary trials. In those trials you had $(a_i - 1)$ outcomes of value $i$. As an example consider estimating the probability of getting different numbers on a six-sided Skewed Dice (where each side is a different shape). We will estimate the probabilities of rolling each side of this dice by repeatedly rolling the dice $n$ times. This will produce $n$ IID samples. For the MAP paradigm, we are going to need a prior on our belief of each of the parameters $p_1 \dots p_6$. We want to express that we lightly believe that each roll is equally likely.

Before you roll, let's imagine you had rolled the dice six times and had gotten one of each possible values. Thus the "prior" distribution would be Dirichlet($2, 2, 2, 2, 2, 2$). After observing $n_1 + n_2 + \dots + n_6$ new trials with $n_i$ results of outcome $i$, the "posterior" distribution is Dirichlet($2 + n_1, \dots 2 + n_6$). Using a prior which represents one imagined observation of each outcome is called "Laplace smoothing" and it guarantees that none of your probabilities are 0 or 1.

Gamma

The Gamma($k, \theta$) distribution is the conjugate prior for the $\lambda$ parameter of the Poisson distribution (It is also the conjugate for Exponential, but we won't delve into that).

The hyperparameters can be interpreted as: you saw $k$ total imaginary events during $\theta$ imaginary time periods. After observing $n$ events during the next $t$ time periods the posterior distribution is Gamma($k + n, \theta + t$).

For example Gamma(10, 5) would represent having seen 10 imaginary events in 5 time periods. It is like imagining a rate of 2 with some degree of confidence. If we start with that Gamma as a prior and then see 11 events in the next 2 time periods our posterior is Gamma(21,7) which is equivalent to an updated rate of 3.