Lecture 001

Statistics: using data to infer probabilistic model (distribution with unknown parameter)

Maximum Likelihood Estimate

Maximum Likelihood Estimate (MLE):

WARNING: we treat p_{X;\theta}(k) as a function of \theta (the parameter) instead of k (the observed data).

Example: The likelihood of X \sim \text{Binomial}(100, p) = k is

Pr\{X = k\} = p_{X;p}(k) = \begin{cases} {100 \choose k}p^k(1 - p)^{100 - k} & \text{if } 0 \leq k \leq 100\\ 0 & \text{otherwise}\\ \end{cases}

To get the maximum likelihood estimate \hat{p}_{ML}(k) for p, we take the derivative:

\begin{align*} \frac{d}{dp}p_{X;p}(k) =& {100 \choose k} \left(kp^{k - 1}(1 - p)^{100 - k} + p^k(100 - k)(1 - p)^{99 - k} \cdot (-1)\right)\\ =& {100 \choose k} p^{k - 1}(1 - p)^{99 - k}(k - 100p)\\ \end{align*}

We got \frac{d}{dp}p_{X;p} = 0 when p = \frac{k}{100} (verify that \frac{d^2}{dp^2}p_{X;p} > 0). Therefore, maximum likelihood estimate is:

\hat{p}_{ML}(k) = \frac{k}{100}

Maximum Likelihood Estimate with i.i.d. data

Say we have n random variables X_{1:n} := X_1, X_2, ..., X_n, then the joint distribution is p_{X_{1:n};\theta}(k_{1:n})

Example: we want to know \lambda parameter in 30 days in which each day a random variable X is drawn out of \text{Poisson}(\lambda). We know X_1 = k_1, X_2 = k_2, ..., X_{30} = k_{30}.

\begin{align*} &p_{X_{1:30};\lambda}(k_{1:30})\\ =& p_{X_{1;\lambda}}(k_1) \cdot p_{X_{1;\lambda}}(k_1) \tag{by independence}\\ =& \frac{\lambda^{k_1+k_2+...k_{30}}e^{-30\lambda}}{k_1! \cdot k_2! \cdot ... \cdot k_{30}!}\\ \end{align*}

To maximize above equation, we compute derivative:

\frac{d}{d\lambda}(\lambda) = \frac{(k_1 + ... + k_{30})\lambda^{k_1 + ... + k_{30 - 1}}e^{-30\lambda}+\lambda^{k_1 + ... + k_{30}}e^{-30\lambda}(-30)}{k_1! \cdot k_2! \cdot ... \cdot k_{30}!}

We got \frac{d}{d\lambda}p_{X_{1:30};\lambda} = 0 when \hat{\lambda}_{ML}(k_{1:30}) = \frac{k_1 + ... + k_{30}}{30} (verify that \frac{d^2}{d\lambda^2}p_{X_{1:30};\lambda} > 0).

This result make sense because the distribution is Poisson.

Log-likelihood

Log-likelihood

Log-likelihood

Maximizing the log-likelihood is equivalent to maximizing the likelihood since \ln(x) is strictly increasing.

Using above example:

\begin{align*} \ln(p_{X_{1:30};\lambda}(k_{1:30})) =& \sum_{1}^{30} \ln \frac{e^{-\lambda}\lambda^{k_i}}{k_i!}\\ =& -30\lambda + \left(\sum_{i = 1}^{30}k_i\right)\ln \lambda - \sum_{i = 1}^{30}\ln(k_i!)\\ \frac{d}{d\lambda} \ln(p_{X_{1:30};\lambda}(k_{1:30})) =& -30 + \left(\sum_{i = 1}^{30}k_i\right) \frac{1}{\lambda}\\ \end{align*}

Therefore, \hat{\lambda}_{ML}(k_{1:30}) = \frac{\sum_{i = 1}^{30}k_i}{30}

Maximum Likelihood Estimate with Continuous Random Variables

Exponential Maximum Likelihood

Example: maximum likelihood for n many i.i.d. X_i \sim \text{Exponential}(\lambda)

\begin{align*} &f_{X_{1:n};\lambda}(t_{1:n})\\ =& \prod_{i = 1}^n f_{X_i;\lambda}(t_i) \tag{by independence}\\ =& \prod_{i = 1}^n \lambda^{-\lambda t_i}\\ =& \lambda^n e^{-\lambda \sum_{i = 1}^n t_i}\\ \end{align*}

Differentiate above we get:

\begin{align*} &\frac{d}{d\lambda} \lambda^n e^{-\lambda \sum_{i = 1}^n t_i}\\ =& n\lambda^{n - 1}e^{-\lambda \sum_{i =1}^n t_i} + \lambda^n e^{-\lambda \sum_{i = 1}^n t_i} \cdot \left(- \sum_{i = 1}^n t_i\right)\\ =& \lambda^{n - 1}e^{-\lambda \sum_{i = 1}^n t_1} \left(n - \lambda \sum_{i = 1}^n t_i\right)\\ \end{align*}

Setting derivative to 0, we get:

\hat{\lambda}_{ML}(t_{1:n}) = \frac{n}{\sum_{i = 1}^n t_i}

This is intuitive because the mean of Exponential(\lambda) is \frac{1}{\lambda}.

Normal Maximum Likelihood

Let X_{1:n} be i.i.d. such that each X_i \sim \text{Normal}(\mu, \sigma^2). We find \hat{\sigma}_{ML}(t_{1:n}) using log-likelihood, assuming \mu is known.

\begin{align*} &\ln f_{X_{1:n};\sigma}(t_{1:n})\\ =& \ln (\prod_{i = 1}^n f_{X_i:\sigma}(t_i))\tag{by independence}\\ =& \sum_{i = 1}^n \ln f_{X_i;\sigma}(t_i) \tag{by property of $\ln$}\\ =& \sum_{i = 1}^n \ln \left(\frac{1}{\sqrt{2\pi} \sigma}e^{-\frac{(t_i - \mu)^2}{2\sigma^2}}\right)\\ =& \sum_{i = 1}^n \left(-\frac{(t_i - \mu)^2}{2 \sigma^2} - \ln \sigma - \ln \sqrt{2\pi}\right)\\ =& -\frac{1}{2\sigma^2} \sum_{i = 1}^n (t_i - \mu)^2 - n \ln \sigma - n \ln \sqrt{2\pi}\\ \end{align*}

Taking the derivative:

\begin{align*} &\frac{d}{d\sigma} \ln f_{X_{1:n};\sigma}(t_{1:n})\\ =& \frac{1}{\sigma^3} \sum_{i = 1}^n (t_i - \mu)^2 - \frac{n}{a}\\ =& \frac{1}{\sigma} \left(\frac{1}{\sigma^2} \sum_{i = 1}^n (t_i - \mu)^2 - n\right)\\ \end{align*}

Setting derivative to 0, we get:

\hat{\sigma}_{ML}(t_{1:n}) = \sqrt{\frac{\sum_{i = 1}^n (t_i - \mu)^2}{n}}

Estimator as a Function of Data

For \text{Binomial}(n, p), we know that given data k, for fixed n, \hat{p}_{ML}(k) = \frac{k}{n}. But we can treat k itself as a random variable. Let k \sim X, then \hat{p}_{ML}(X) becomes a random variable.

For \text{Binomial}(n, p) with fixed n, we have:

E[\hat{p}_{ML}(X)] = E[\frac{X}{n}] = E[X] \cdot E[\frac{1}{n}] = np \cdot \frac{1}{n} = p

Here, the mean is equal to the parameter being estimated, so we say the estimator is "unbiased".

Var(\hat{p}_{ML}(X)) = Var(\frac{X}{n}) = \frac{p(1 - p)}{\sqrt{n}}

QUESTION: I think the textbook is wrong here. The denominator should contain square root for variance.

Therefore \hat{\theta}_{ML}(X_1, ..., X_n) is a random variable. We study its probability distribution to know how good our estimation is.

Table of Content