Assuming we have training data (feature vector and level):
We assume that the joint distribution are i.i.d.
We need a function h(X) to minimize Generalization Error:
Assume we have the joint distribution f_{X, Y}(x, y) = f_{X | Y = y}(x)f_{Y}(y) (or likelihood f_{X | Y}(x) and prior Pr\{Y = 1\}, Pr\{Y = 0\}), binary classification is equivalent to hypothesis testing using MAP decision rule.
Bayes Classifier: Given that we know joint p.d.f. We can compair the join distribution Pr\{X = x | Y = 0\} with Pr\{X = x | Y = 1\}.
We can write m: \mathbb{R}^d \to \mathbb{R} as a regression function:
Since MAP minimizes Generalization Error, Bayes Classifier is the optimal classifier.
Use data to estimate true distribution:
training data: (X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_n)
testing data: (T_1, L_1), (T_2, L_2), ..., (T_m, L_m)
Training: we see estimated distribution as a classifier
Training Error (empirical risk): the fraction of training data that gets classified to the wrong label under classifier h.
This is random because the training data is random
Testing Error: since we need a way to estimate error, we need a test set that is independent from training set. We hope test error is an estimation of generalization error.
This is random because the training data is random. A good test set will approximate the Generalization Error.
We can estimate parameter of the distribution (we still need to pick a distribution type first) using data and apply MAP on it.
Linear classifier: where w \in \mathbb{R}^d is weight vector and w^Tx is a weighted sum of the features, b \in \mathbb{R} is bias.
Support Vector Machine (SVM): learns linear classifier that maximizes the separation between data points of different labels.
Table of Content