Lecture 003

Bayesian Framework

Idea: treat unknown parameter $T$ as not deterministic, but a random variable.

Likelihood Distribution

We define likelihood by condition on underlying distribution $T$ :

$\begin{align*} &Pr\{X = k | T = t\}\\ =& p_{X | T = t}(k) \tag{p.m.f. for discrete $X$}\\ =& f_{X | T = t}(k) \tag{p.d.f. for continuous $X$}\\ \end{align*}$

There is no conditioning in classical framework since we treat them as deterministic $\theta$

Posterior Distribution

prior distribution: what we know about underlying distribution before seeing data

$\begin{align*} &Pr\{T = t\}\\ =& p_T(t) \tag{p.m.f. for discrete $T$}\\ =& f_T(t) \tag{p.d.f. for continuous $T$}\\ \end{align*}$

posterior distribution (posterior probability): probabilities computed after seeing $X = k$

$\begin{align*} Pr\{T = t | X = k\} \equiv& \frac{Pr\{X = k | T = t\}Pr\{T = t\}}{Pr\{X = k\}}\\ p_{T|X=k}(t) \equiv& \frac{p_{X | T = t}(k) p_T(t)}{p_{X}(k)} \tag{p.m.f. for discrete $T$, discrete $X$}\\ p_{T|X=k}(t) \equiv& \frac{f_{X | T = t}(k) p_T(t)}{f_{X}(k)} \tag{p.m.f. for discrete $T$, continuous $X$}\\ p_{T|X=k}(t) \equiv& \frac{p_{X | T = t}(k) f_T(t)}{p_{X}(k)} \tag{p.m.f. for continuous $T$, discrete $X$}\\ f_{T|X=k}(t) \equiv& \frac{f_{X | T = t}(k) f_T(t)}{f_{X}(k)} \tag{p.d.f. for continuous $T$, continuous $X$}\\ \end{align*}$

// TODO

The idea is that we have a prior distribution $Pr\{T = t\} = p_T(t)$ about what we think of the underlying variable is before seeing data. But then we change our estimate of the underlying variable based on new data $X = k$ .

Examples:

$Pr\{T = "Good" | X = k\}$
$Pr\{T = "Bad" | X = k\}$

Maximize A Posteriori (MAP) Estimate:

given prior $p_T(t)$ or $Pr\{T = t\}$
given distribution $X$ as function of $T$

$\begin{align*} \hat{T}_{MAP}(k) \in& \arg\max_t Pr\{T = t | X = k\}\\ \in& \arg\max_t \frac{Pr\{X = k | T = t\}Pr\{T = t\}}{Pr\{X = k\}}\\ \in& \arg\max_t Pr\{X = k | T = t\}Pr\{T = t\}\\ \in& \arg\max_t p_{T, X}(t, k)\\ \end{align*}$

Note that MAP is equivalent of maximizing the joint probability $p_{T, X}(t, k)$ over $t$ .

When $p_T(t)$ is uniform, then $p_T(t)$ does not depend on $t$ . Therefore, Maximize A Posteriori (MAP) Estimate is Maximum Likelihood Estimate (MLE).

The difference between MLE and MAP is whether we include the prior in our maximizing target.

Error Probability

Error Probability of Estimator Theorem: the MAP estimator minimizes the error probability among all the estimators

$Pr\{\hat{T}(X) \neq T\}$

Notice both $\hat{T}(X)$ and $T$ are random variables.

$\begin{align*} \hat{T}_{MAP}(k) \in& \arg\max_t Pr\{X = k | T = t\}Pr\{T = t\}\\ \in& \arg\max_t Pr\{X = k | T = t\}\\ \hat{T}_{ML}(k) \in& \arg\max_t \text{likelihood}\\ \end{align*}$

Example: Signal and Noise

Given data $X = T + W$ , what is the MAP estimate of $T$ after knowing $X = x$

Signal: $T \sim \text{Normal}(\mu, \sigma^2)$ where $\mu, \sigma^2$ is known
Noise: $W \sim \text{Normal}(0, \tau^2)$ where $\tau$ is known

We maximize posterior:

$\begin{align*} p_{T, X}(t, x) =& p_{X | T = t}(x) \cdot p_T(t)\\ =& p_{t + \text{Normal}(0, \tau^2)} \cdot p_{\text{Normal}(\mu, \sigma^2)}\\ =& p_{\text{Normal}(t+0, 0+\tau^2)} \cdot p_{\text{Normal}(\mu, \sigma^2)}\\ \end{align*}$

So we maximize $\ln p_{\text{Normal}(t, \tau^2)} + \ln p_{\text{Normal}(\mu, \sigma^2)}$ over $t$

Final Answer:

$\hat{T}_{MAP}(x) = \frac{(1/\sigma^2)\mu + (1/\tau^2)x}{(1/\sigma^2) + (1/\tau^2)}$

Observe $Pr\{\text{Error}\} = Pr\{\frac{(1/\sigma^2)\mu + (1/\tau^2)x}{(1/\sigma^2) + (1/\tau^2)} \neq T\} = 1$

Note that $\hat{T}_{MAP}(x) = x$ . So $\hat{T}_{MAP}(x)$ is weighted average between $\mu$ (with weight $\frac{1}{\sigma^2}$ ) and $x$ (with weight $\frac{1}{\tau^2}$ ). The more variable it is, the smaller weight it has. We can see MAP incorporates both the prior and data.

Table of Content