Lecture 003

Bayesian Framework

Idea: treat unknown parameter T as not deterministic, but a random variable.

Likelihood Distribution

We define likelihood by condition on underlying distribution T:

\begin{align*} &Pr\{X = k | T = t\}\\ =& p_{X | T = t}(k) \tag{p.m.f. for discrete $X$}\\ =& f_{X | T = t}(k) \tag{p.d.f. for continuous $X$}\\ \end{align*}

There is no conditioning in classical framework since we treat them as deterministic \theta

Posterior Distribution

prior distribution: what we know about underlying distribution before seeing data

\begin{align*} &Pr\{T = t\}\\ =& p_T(t) \tag{p.m.f. for discrete $T$}\\ =& f_T(t) \tag{p.d.f. for continuous $T$}\\ \end{align*}

posterior distribution (posterior probability): probabilities computed after seeing X = k

\begin{align*} Pr\{T = t | X = k\} \equiv& \frac{Pr\{X = k | T = t\}Pr\{T = t\}}{Pr\{X = k\}}\\ p_{T|X=k}(t) \equiv& \frac{p_{X | T = t}(k) p_T(t)}{p_{X}(k)} \tag{p.m.f. for discrete $T$, discrete $X$}\\ p_{T|X=k}(t) \equiv& \frac{f_{X | T = t}(k) p_T(t)}{f_{X}(k)} \tag{p.m.f. for discrete $T$, continuous $X$}\\ p_{T|X=k}(t) \equiv& \frac{p_{X | T = t}(k) f_T(t)}{p_{X}(k)} \tag{p.m.f. for continuous $T$, discrete $X$}\\ f_{T|X=k}(t) \equiv& \frac{f_{X | T = t}(k) f_T(t)}{f_{X}(k)} \tag{p.d.f. for continuous $T$, continuous $X$}\\ \end{align*}


The idea is that we have a prior distribution Pr\{T = t\} = p_T(t) about what we think of the underlying variable is before seeing data. But then we change our estimate of the underlying variable based on new data X = k.


Maximize A Posteriori (MAP) Estimate:

\begin{align*} \hat{T}_{MAP}(k) \in& \arg\max_t Pr\{T = t | X = k\}\\ \in& \arg\max_t \frac{Pr\{X = k | T = t\}Pr\{T = t\}}{Pr\{X = k\}}\\ \in& \arg\max_t Pr\{X = k | T = t\}Pr\{T = t\}\\ \in& \arg\max_t p_{T, X}(t, k)\\ \end{align*}

Note that MAP is equivalent of maximizing the joint probability p_{T, X}(t, k) over t.

When p_T(t) is uniform, then p_T(t) does not depend on t. Therefore, Maximize A Posteriori (MAP) Estimate is Maximum Likelihood Estimate (MLE).

The difference between MLE and MAP is whether we include the prior in our maximizing target.

Error Probability

Error Probability of Estimator Theorem: the MAP estimator minimizes the error probability among all the estimators

Pr\{\hat{T}(X) \neq T\}

Notice both \hat{T}(X) and T are random variables.

\begin{align*} \hat{T}_{MAP}(k) \in& \arg\max_t Pr\{X = k | T = t\}Pr\{T = t\}\\ \in& \arg\max_t Pr\{X = k | T = t\}\\ \hat{T}_{ML}(k) \in& \arg\max_t \text{likelihood}\\ \end{align*}

Example: Signal and Noise

Given data X = T + W, what is the MAP estimate of T after knowing X = x

We maximize posterior:

\begin{align*} p_{T, X}(t, x) =& p_{X | T = t}(x) \cdot p_T(t)\\ =& p_{t + \text{Normal}(0, \tau^2)} \cdot p_{\text{Normal}(\mu, \sigma^2)}\\ =& p_{\text{Normal}(t+0, 0+\tau^2)} \cdot p_{\text{Normal}(\mu, \sigma^2)}\\ \end{align*}

So we maximize \ln p_{\text{Normal}(t, \tau^2)} + \ln p_{\text{Normal}(\mu, \sigma^2)} over t

Final Answer:

\hat{T}_{MAP}(x) = \frac{(1/\sigma^2)\mu + (1/\tau^2)x}{(1/\sigma^2) + (1/\tau^2)}

Observe Pr\{\text{Error}\} = Pr\{\frac{(1/\sigma^2)\mu + (1/\tau^2)x}{(1/\sigma^2) + (1/\tau^2)} \neq T\} = 1

Note that \hat{T}_{MAP}(x) = x. So \hat{T}_{MAP}(x) is weighted average between \mu (with weight \frac{1}{\sigma^2}) and x (with weight \frac{1}{\tau^2}). The more variable it is, the smaller weight it has. We can see MAP incorporates both the prior and data.

Table of Content