Idea: treat unknown parameter T as not deterministic, but a random variable.
We define likelihood by condition on underlying distribution T:
There is no conditioning in classical framework since we treat them as deterministic \theta
prior distribution: what we know about underlying distribution before seeing data
posterior distribution (posterior probability): probabilities computed after seeing X = k
// TODO
The idea is that we have a prior distribution Pr\{T = t\} = p_T(t) about what we think of the underlying variable is before seeing data. But then we change our estimate of the underlying variable based on new data X = k.
Examples:
Pr\{T = "Good" | X = k\}
Pr\{T = "Bad" | X = k\}
Maximize A Posteriori (MAP) Estimate:
given prior p_T(t) or Pr\{T = t\}
given distribution X as function of T
Note that MAP is equivalent of maximizing the joint probability p_{T, X}(t, k) over t.
When p_T(t) is uniform, then p_T(t) does not depend on t. Therefore, Maximize A Posteriori (MAP) Estimate is Maximum Likelihood Estimate (MLE).
The difference between MLE and MAP is whether we include the prior in our maximizing target.
Error Probability of Estimator Theorem: the MAP estimator minimizes the error probability among all the estimators
Notice both \hat{T}(X) and T are random variables.
Given data X = T + W, what is the MAP estimate of T after knowing X = x
Signal: T \sim \text{Normal}(\mu, \sigma^2) where \mu, \sigma^2 is known
Noise: W \sim \text{Normal}(0, \tau^2) where \tau is known
We maximize posterior:
So we maximize \ln p_{\text{Normal}(t, \tau^2)} + \ln p_{\text{Normal}(\mu, \sigma^2)} over t
Final Answer:
Observe Pr\{\text{Error}\} = Pr\{\frac{(1/\sigma^2)\mu + (1/\tau^2)x}{(1/\sigma^2) + (1/\tau^2)} \neq T\} = 1
Note that \hat{T}_{MAP}(x) = x. So \hat{T}_{MAP}(x) is weighted average between \mu (with weight \frac{1}{\sigma^2}) and x (with weight \frac{1}{\tau^2}). The more variable it is, the smaller weight it has. We can see MAP incorporates both the prior and data.
Table of Content