Lecture 006

Classification

Problem Formation

Assuming we have training data (feature vector and level):

$\{(X_1, Y_1), (X_2, Y_2), (X_3, Y_3), ..., (X_n, Y_n)\}$

We assume that the joint distribution are i.i.d.

$(\forall i)(f_{X_i, Y_u}(x_i, y_i) \sim f_{X, Y}(x, y))$

We need a function $h(X)$ to minimize Generalization Error:

$R(h) = Pr\{h(X) \neq Y\}$

Bayes Classifier Given Distribution

Assume we have the joint distribution $f_{X, Y}(x, y) = f_{X | Y = y}(x)f_{Y}(y)$ (or likelihood $f_{X | Y}(x)$ and prior $Pr\{Y = 1\}, Pr\{Y = 0\}$ ), binary classification is equivalent to hypothesis testing using MAP decision rule.

$H_0: Y = 0, H_1: Y = 1$

Bayes Classifier: Given that we know joint p.d.f. We can compair the join distribution $Pr\{X = x | Y = 0\}$ with $Pr\{X = x | Y = 1\}$ .

$h^*(x) = \begin{cases} 1 & \text{if } Pr\{Y = 1 | X = x\} \leq \frac{1}{2}\\ 0 & \text{otherwise} \end{cases}$

We can write $m: \mathbb{R}^d \to \mathbb{R}$ as a regression function:

$m(x) = Pr\{Y = 1 | X = x\}$

Since MAP minimizes Generalization Error, Bayes Classifier is the optimal classifier.

MAP Classifier Given Distribution Type

Use data to estimate true distribution:

training data: $(X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_n)$
testing data: $(T_1, L_1), (T_2, L_2), ..., (T_m, L_m)$

Training: we see estimated distribution as a classifier

Training Error (empirical risk): the fraction of training data that gets classified to the wrong label under classifier $h$ .

$\widehat{R}(h) = \frac{1}{n} \sum_{i = 1}^n \mathbb{1}_{\{h(X_i) \neq Y_i\}}$

This is random because the training data is random

Testing Error: since we need a way to estimate error, we need a test set that is independent from training set. We hope test error is an estimation of generalization error.

$\widetilde{R}(h) = \frac{1}{m} \sum_{i = 1}^m \mathbb{1}_{\{h(T_i) \neq L_i\}}$

This is random because the training data is random. A good test set will approximate the Generalization Error.

We can estimate parameter of the distribution (we still need to pick a distribution type first) using data and apply MAP on it.

Linear Classifier

Linear classifier: where $w \in \mathbb{R}^d$ is weight vector and $w^Tx$ is a weighted sum of the features, $b \in \mathbb{R}$ is bias.

$h(x) = \begin{cases} 0 & \text{if } w^Tx + b \leq 0\\ 1 & \text{otherwise}\\ \end{cases}$

Support Vector Machine (SVM): learns linear classifier that maximizes the separation between data points of different labels.

Table of Content