Maximum Likehood Estimation

Machine Learning

Given $N$ independent and identically distributed (i.i.d) variables $X_1, X_2 …, X_N$ from a distribution $D$ with a parameter $\theta$, we can use a function $f$ to estimate $\theta$.

Description

The joint density function is $f(X_1, X_2 … X_N |\theta) = f(X_1|\theta)f(X_2|\theta)…f(X_N|\theta)$. We need to find a $\theta$ to maximum the logistic joint function $L_N(\theta)= \log f(X_1, X_2 … X_N |\theta) = \frac{1}{N}\sum_{i=1}^{N}\log f(X_i|\theta)$.

Assume $\hat{\theta}$ is the value of $\theta$ that maximize $L_N(\theta)$ and $\theta_0$ is the ture value of the parameter $\theta$ there are:

  1. when $N\rightarrow \infty $, $\hat{\theta}=\theta_0$
  2. $\sqrt{N}(\hat{\theta}-\theta_0) \sim N(0, I(\theta_0)^{-1})$, $I(\theta_0)$ is the fisher information, $I(\theta_0) = E_{\theta_0}[(\frac{\partial}{\partial\theta}\log f(X|\theta))^2]$.

Proof

1. when $N\rightarrow \infty$, $\hat{\theta}=\theta_0$

Define $l(X|\theta) = \log f(X|\theta)$ and $L(\theta) = E_{\theta_0}(l(X|\theta))$. So we can have $L(\theta) =\int (\log f(X|\theta))f(X|\theta_0)\mathrm{d}X$. Based on LLN (Law of Large Number), when $N\rightarrow \infty $, $L_N(\theta)\rightarrow E_{\theta_0}(l(X|\theta))=L(\theta)$. So we can maximize $E_{\theta_0}(l(X|\theta))$ by maximizing $L_N(\theta)$.

Then we need to proof that $\theta_0$ is the value that maximizes $E_{\theta_0}(l(X|\theta))$, in other words for any $\theta$, $L(\theta) \leq L(\theta_0)$. Proof: (Here we have taken the advantage of the mathematical property that $\log t \leq t-1$)

$$L(\theta)-L(\theta_0) = E_{\theta_0}(l(X|\theta)-l(X|\theta_0))\\
= E_{\theta_0}(\log f(X|\theta)-\log f(X|\theta_0))\\
= E_{\theta_0}\log \frac{f(X|\theta)}{f(X|\theta_0)}\\
\leq E_{\theta_0}(\frac{f(X|\theta)}{f(X|\theta_0)}-1)\\
= \int [f(X|\theta)-f(X|\theta_0)]\mathrm{d}X\\
= 1-1 = 0$$

2. $\sqrt{N}(\hat{\theta}-\theta_0) \sim N(0, I(\theta_0)^{-1})$

Based on mean value theorem and taylor theorem, there is a $\theta \in [\hat{\theta}, \theta_0]$ satisfying $L’_N(\hat{\theta}) = L’_N(\theta_0) + L’’_N(\theta_0)(\hat{\theta}-\theta_0) + \frac{1}{2}L^{(3)}_N(\theta)(\hat{\theta}-\theta_0)^2$ as $0 = L’_N(\hat{\theta})$ and $\frac{1}{2}L^{(3)}_N(\theta)(\hat{\theta}-\theta_0)^2$ is negligible, then we can have $\hat{\theta}-\theta_0 = \frac{-L’_N(\theta_0)}{L’’_N(\theta)}$ and $\sqrt{N}(\hat{\theta}-\theta_0) = \frac{-\sqrt{N}L’_N(\theta_0)}{L’’_N(\theta_0)}$

$$E_{\theta_0}(l’(X|\theta)) = \int [\frac{\partial }{\partial\theta}\log f(X|\theta)]|_{\theta = \theta_0}f(X|\theta_0)\mathrm{d}X\\
= \int \frac{\partial }{\partial\theta}f(X|\theta)|_{\theta = \theta_0}\mathrm{d}X\\
= \frac{\partial }{\partial\theta}\int f(X|\theta)\mathrm{d}X|_{\theta = \theta_0} = 0$$

$$\sqrt{N}L’_N(\theta_0) = \sqrt{N}[\frac{1}{N}\sum_{i=1}^Nl’(X_i|\theta_0)]\\
= \sqrt{N}[\frac{1}{N}\sum_{i=1}^Nl’(X_i|\theta_0) - 0]\\
= \sqrt{N}[\frac{1}{N}\sum_{i=1}^Nl’(X_i|\theta_0) - E_{\theta_0}(l’(X|\theta_0)]$$

Based on CLT (Central Limit Theorem), we can have $\sqrt{N}L’_N(\theta_0) \sim N(0, D_{\theta_0}(l’(X|\theta))$ and $\sqrt{N}(\hat{\theta}-\theta_0) \sim N(0, \frac{D_{\theta_0}(l’(X|\theta)}{[L’’_N(\theta_0)]^2})$.

$L’’_N(\theta_0) = \sum_{i=1}^{N}\frac{\partial^2}{\partial\theta^2}\log f(X_i|\theta)|_{\theta=\theta_0}$ based on LLN when $N\rightarrow \infty$, we can have:

$$L’’_N(\theta_0) = E_{\theta_0}(l’’(X|\theta))\\
= \int [-\frac{[f’(X|\theta_0)]^2}{f^2(X|\theta_0)} + \frac{f’’(X|\theta_0)}{f(X|\theta_0)}]f(X|\theta_0)\mathrm{d}X\\
= -\int [\frac{[f’(X|\theta_0)]^2}{f^2(X|\theta_0)}]f(X|\theta_0)\mathrm{d}X + \int f’’(X|\theta_0)\mathrm{d}X\\
= -E_{\theta_0}[(l’(X|\theta))^2]+0 = -I(\theta_0)$$

$$D_{\theta_0}(l’(X|\theta) = E_{\theta_0}[l’(X|\theta)]^2 - E_{\theta_0}^2[l’(X|\theta)] = I(\theta_0)$$

So we can have $\sqrt{N}(\hat{\theta}-\theta_0) \sim N(0, I(\theta_0)^{-1})$.