Gaussian Mixtures
Likelihood:
Log Likelihood:
Best Seen:
What is a Gaussian Mixture?
A Gaussian Mixture describes a random variable whose PDF could come from one of two Gaussians (or more, but we will just use two in this demo). There is a certain probability the sample will come from the first gaussian, otherwise it comes from the second. It has five parameters: 4 to describe the two gaussians and one to describe the relative weighting of the two gaussians.
Generative Code
from scipy import stats
def sample():
# choose group membership
membership = stats.bernoulli.rvs(0.2)
if membership == 1:
# sample from gaussian 1
return stats.norm.rvs(3.5,0.7)
else:
# sample from gaussian 2
return stats.norm.rvs(6.8,0.7)
Probability Density Function
\begin{align*} f(X=x) = t \cdot f(A=x) + (1-t) \cdot f(B=x) \end{align*} $$\text{st}$$ $$A \sim N(\mu_a, \sigma_a^2)$$ $$B \sim N(\mu_b, \sigma_b^2)$$Putting it all together, the PDF of a Gaussian Mixture is:
\begin{align*} f(x) = t \cdot \Big(\frac{1}{\sqrt{2\pi}\sigma_a} e^{-\frac{1}{2}(\frac{x-\mu_a}{\sigma_a})^2}\Big) + (1-t) \cdot \Big(\frac{1}{\sqrt{2\pi}\sigma_b} e^{-\frac{1}{2}(\frac{x-\mu_b}{\sigma_b})^2}\Big) \end{align*}MLE for Gaussian Mixture
Special note: even though the generative story has a bernoulli (group membership) it is never observed. MLE maximizes the likelihood of the observed data.
Let $\vec{\theta} = [t, \mu_a,\mu_b,\sigma_a, \sigma_b]$ be the parameters. Because the math will get long I will use $\theta$ as notation in place of $\vec{\theta}$. Just keep in mind that it is a vector.
The MLE idea is to chose values of $\theta$ which maximize log likelihood. All optimization methods require us to calculate the partial derivatives of the thing we want to optimize (log likelihood) with respect to the values we can change (our parameters).
Likelihood function
\begin{align*} L(\theta) &= \prod_i^n f(x_i | \theta) \\ &= \prod_i^n \Big[t \cdot \Big(\frac{1}{\sqrt{2\pi}\sigma_a} e^{-\frac{1}{2}(\frac{x_i-\mu_a}{\sigma_a})^2}\Big) + (1-t) \cdot \Big(\frac{1}{\sqrt{2\pi}\sigma_b} e^{-\frac{1}{2}(\frac{x_i-\mu_b}{\sigma_b})^2}\Big) \Big] \end{align*}Log Likelihood function
\begin{align*} LL(\theta) &= \log L(\theta) \\ &= \log \prod_i^n f(x_i | \theta) \\ &= \sum_i^n \log f(x_i | \theta) \\ \end{align*}That is sufficient for now, but if you wanted to expand out the term you would get: \begin{align*} LL(\theta) &= \sum_i^n \log \Big[t \cdot \Big(\frac{1}{\sqrt{2\pi}\sigma_a} e^{-\frac{1}{2}(\frac{x_i-\mu_a}{\sigma_a})^2}\Big) + (1-t) \cdot \Big(\frac{1}{\sqrt{2\pi}\sigma_b} e^{-\frac{1}{2}(\frac{x_i-\mu_b}{\sigma_b})^2}\Big) \Big] \end{align*}
Derivative of LL with respect to $\theta$
Here is an example of calculating a partial derivative with respect to one of the parameters, $\mu_a$. You would need a derivative like this for all parameters.
A good first step when doing a huge derivative of a log likelihood function is to think of the derivative for the log of likelihood of a single datapoint. This is the inner sum in the log likelihood expression: $$ \frac{\d}{\d \mu_a} \log f(x_i|\theta) $$
Before we start: notice that $\mu_a$ does not show up in this term from $f(x_i|\theta) $: $$(1-t) \cdot \Big(\frac{1}{\sqrt{2\pi}\sigma_b} e^{-\frac{1}{2}(\frac{x_i-\mu_b}{\sigma_b})^2}\Big) =K $$ In the proof, when we encounter this term, we are going to think of it as a constant which we call $K$. Ok, lets go for it! \begin{align*} \frac{\d}{\d \mu_a} & \log f(x_i|\theta) \\ &= \frac{1}{f(x_i|\theta)} \frac{\d}{\d \mu_a} f(x_i|\theta) &&\text{chain rule on }\log \\ &= \frac{1}{f(x_i|\theta)} \frac{\d}{\d \mu_a} \Big[t \cdot \Big(\frac{1}{\sqrt{2\pi}\sigma_a} e^{-\frac{1}{2}(\frac{x_i-\mu_a}{\sigma_a})^2}\Big) + K\Big] &&\text{substitute in }f(x_i|\theta)\\ &= \frac{1}{f(x_i|\theta)} \frac{\d}{\d \mu_a} \Big[t \cdot \Big(\frac{1}{\sqrt{2\pi}\sigma_a} e^{-\frac{1}{2}(\frac{x_i-\mu_a}{\sigma_a})^2}\Big)\Big] &&\frac{\d}{\d \mu_a} K = 0\\ &= \frac{t}{f(x_i|\theta) \sqrt{2\pi}\sigma_a} \cdot \frac{\d}{\d \mu_a} e^{-\frac{1}{2}(\frac{x_i-\mu_a}{\sigma_a})^2} &&\text{pull out const}\\ &= \frac{t}{f(x_i|\theta) \sqrt{2\pi}\sigma_a} \cdot e^{-\frac{1}{2}(\frac{x_i-\mu_a}{\sigma_a})^2} \cdot \frac{\d}{\d \mu_a} -\frac{1}{2}(\frac{x_i-\mu_a}{\sigma_a})^2&&\text{chain on }e^x \\ &= \frac{t}{f(x_i|\theta) \sqrt{2\pi}\sigma_a} \cdot e^{-\frac{1}{2}(\frac{x_i-\mu_a}{\sigma_a})^2} \cdot \Big[-(\frac{x_i-\mu_a}{\sigma_a}) \frac{\d}{\d \mu_a} (\frac{x_i-\mu_a}{\sigma_a})\Big]&&\text{chain on }x^2 \\ &= \frac{t}{f(x_i|\theta) \sqrt{2\pi}\sigma_a} \cdot e^{-\frac{1}{2}(\frac{x_i-\mu_a}{\sigma_a})^2} \cdot \Big[ -(\frac{x_i-\mu_a}{\sigma_a}) \cdot \frac{-1}{\sigma_a}\Big] &&\text{final derivative} \\ &= \frac{t}{f(x_i|\theta) \sqrt{2\pi}\sigma_a^3} \cdot e^{-\frac{1}{2}(\frac{x_i-\mu_a}{\sigma_a})^2} \cdot (x_i-\mu_a) &&\text{simplify} \\ \end{align*}
That was for a single data-point. For the full dataset: \begin{align*} \frac{\d LL(\theta)}{\d \mu_a} &= \sum_i^n \frac{\d}{\d \mu_a} \log f(x_i|\theta)\\ &= \sum_i^n \frac{t}{f(x_i|\theta) \sqrt{2\pi}\sigma_a^3} \cdot e^{-\frac{1}{2}(\frac{x_i-\mu_a}{\sigma_a})^2} \cdot (x_i-\mu_a) \end{align*}
This process should be repeated for all five parameters! Now, how should we find a value of $\mu_a$, which, in the presence of the other settings to parameters, and the data, makes this derivative zero? Setting the derivative = 0 and solving for $\mu_a$ is not going to work.
Use an Optimizer to Estimate Params
Once we have a $LL$ function and the derivative of $LL$ with respect to each parameter we are ready to compute argmax using an optimizer. In this case the best choice would probably be gradient ascent (or gradient descent with negative log likelihood). \begin{align*} \nabla_{\theta} LL(\theta) &=\begin{bmatrix} \frac{\d LL(\theta)}{\d t}\\ \frac{\d LL(\theta)}{\d \mu_a}\\ \frac{\d LL(\theta)}{\d \mu_b}\\ \frac{\d LL(\theta)}{\d \sigma_a}\\ \frac{\d LL(\theta)}{\d \sigma_b}\\ \end{bmatrix} \end{align*}