# Lecture 003

## Bayesian Framework

Idea: treat unknown parameter $T$ as not deterministic, but a random variable.

### Likelihood Distribution

We define likelihood by condition on underlying distribution $T$:

\begin{align*} &Pr\{X = k | T = t\}\\ =& p_{X | T = t}(k) \tag{p.m.f. for discrete $X$}\\ =& f_{X | T = t}(k) \tag{p.d.f. for continuous $X$}\\ \end{align*}

There is no conditioning in classical framework since we treat them as deterministic $\theta$

### Posterior Distribution

prior distribution: what we know about underlying distribution before seeing data

\begin{align*} &Pr\{T = t\}\\ =& p_T(t) \tag{p.m.f. for discrete $T$}\\ =& f_T(t) \tag{p.d.f. for continuous $T$}\\ \end{align*}

posterior distribution (posterior probability): probabilities computed after seeing $X = k$

\begin{align*} Pr\{T = t | X = k\} \equiv& \frac{Pr\{X = k | T = t\}Pr\{T = t\}}{Pr\{X = k\}}\\ p_{T|X=k}(t) \equiv& \frac{p_{X | T = t}(k) p_T(t)}{p_{X}(k)} \tag{p.m.f. for discrete $T$, discrete $X$}\\ p_{T|X=k}(t) \equiv& \frac{f_{X | T = t}(k) p_T(t)}{f_{X}(k)} \tag{p.m.f. for discrete $T$, continuous $X$}\\ p_{T|X=k}(t) \equiv& \frac{p_{X | T = t}(k) f_T(t)}{p_{X}(k)} \tag{p.m.f. for continuous $T$, discrete $X$}\\ f_{T|X=k}(t) \equiv& \frac{f_{X | T = t}(k) f_T(t)}{f_{X}(k)} \tag{p.d.f. for continuous $T$, continuous $X$}\\ \end{align*}

// TODO

The idea is that we have a prior distribution $Pr\{T = t\} = p_T(t)$ about what we think of the underlying variable is before seeing data. But then we change our estimate of the underlying variable based on new data $X = k$.

Examples:

• $Pr\{T = "Good" | X = k\}$

• $Pr\{T = "Bad" | X = k\}$

Maximize A Posteriori (MAP) Estimate:

• given prior $p_T(t)$ or $Pr\{T = t\}$

• given distribution $X$ as function of $T$

\begin{align*} \hat{T}_{MAP}(k) \in& \arg\max_t Pr\{T = t | X = k\}\\ \in& \arg\max_t \frac{Pr\{X = k | T = t\}Pr\{T = t\}}{Pr\{X = k\}}\\ \in& \arg\max_t Pr\{X = k | T = t\}Pr\{T = t\}\\ \in& \arg\max_t p_{T, X}(t, k)\\ \end{align*}

Note that MAP is equivalent of maximizing the joint probability $p_{T, X}(t, k)$ over $t$.

When $p_T(t)$ is uniform, then $p_T(t)$ does not depend on $t$. Therefore, Maximize A Posteriori (MAP) Estimate is Maximum Likelihood Estimate (MLE).

The difference between MLE and MAP is whether we include the prior in our maximizing target.

### Error Probability

Error Probability of Estimator Theorem: the MAP estimator minimizes the error probability among all the estimators

Pr\{\hat{T}(X) \neq T\}

Notice both $\hat{T}(X)$ and $T$ are random variables.

\begin{align*} \hat{T}_{MAP}(k) \in& \arg\max_t Pr\{X = k | T = t\}Pr\{T = t\}\\ \in& \arg\max_t Pr\{X = k | T = t\}\\ \hat{T}_{ML}(k) \in& \arg\max_t \text{likelihood}\\ \end{align*}

### Example: Signal and Noise

Given data $X = T + W$, what is the MAP estimate of $T$ after knowing $X = x$

• Signal: $T \sim \text{Normal}(\mu, \sigma^2)$ where $\mu, \sigma^2$ is known

• Noise: $W \sim \text{Normal}(0, \tau^2)$ where $\tau$ is known

We maximize posterior:

\begin{align*} p_{T, X}(t, x) =& p_{X | T = t}(x) \cdot p_T(t)\\ =& p_{t + \text{Normal}(0, \tau^2)} \cdot p_{\text{Normal}(\mu, \sigma^2)}\\ =& p_{\text{Normal}(t+0, 0+\tau^2)} \cdot p_{\text{Normal}(\mu, \sigma^2)}\\ \end{align*}

So we maximize $\ln p_{\text{Normal}(t, \tau^2)} + \ln p_{\text{Normal}(\mu, \sigma^2)}$ over $t$

Observe $Pr\{\text{Error}\} = Pr\{\frac{(1/\sigma^2)\mu + (1/\tau^2)x}{(1/\sigma^2) + (1/\tau^2)} \neq T\} = 1$
Note that $\hat{T}_{MAP}(x) = x$. So $\hat{T}_{MAP}(x)$ is weighted average between $\mu$ (with weight $\frac{1}{\sigma^2}$) and $x$ (with weight $\frac{1}{\tau^2}$). The more variable it is, the smaller weight it has. We can see MAP incorporates both the prior and data.