Lecture 006

Classification

Problem Formation

Assuming we have training data (feature vector and level):

\{(X_1, Y_1), (X_2, Y_2), (X_3, Y_3), ..., (X_n, Y_n)\}

We assume that the joint distribution are i.i.d.

(\forall i)(f_{X_i, Y_u}(x_i, y_i) \sim f_{X, Y}(x, y))

We need a function h(X) to minimize Generalization Error:

R(h) = Pr\{h(X) \neq Y\}

Bayes Classifier Given Distribution

Assume we have the joint distribution f_{X, Y}(x, y) = f_{X | Y = y}(x)f_{Y}(y) (or likelihood f_{X | Y}(x) and prior Pr\{Y = 1\}, Pr\{Y = 0\}), binary classification is equivalent to hypothesis testing using MAP decision rule.

H_0: Y = 0, H_1: Y = 1

Bayes Classifier: Given that we know joint p.d.f. We can compair the join distribution Pr\{X = x | Y = 0\} with Pr\{X = x | Y = 1\}.

h^*(x) = \begin{cases} 1 & \text{if } Pr\{Y = 1 | X = x\} \leq \frac{1}{2}\\ 0 & \text{otherwise} \end{cases}

We can write m: \mathbb{R}^d \to \mathbb{R} as a regression function:

m(x) = Pr\{Y = 1 | X = x\}

Since MAP minimizes Generalization Error, Bayes Classifier is the optimal classifier.

MAP Classifier Given Distribution Type

Use data to estimate true distribution:

Training: we see estimated distribution as a classifier

Training Error (empirical risk): the fraction of training data that gets classified to the wrong label under classifier h.

\widehat{R}(h) = \frac{1}{n} \sum_{i = 1}^n \mathbb{1}_{\{h(X_i) \neq Y_i\}}

This is random because the training data is random

Testing Error: since we need a way to estimate error, we need a test set that is independent from training set. We hope test error is an estimation of generalization error.

\widetilde{R}(h) = \frac{1}{m} \sum_{i = 1}^m \mathbb{1}_{\{h(T_i) \neq L_i\}}

This is random because the training data is random. A good test set will approximate the Generalization Error.

We can estimate parameter of the distribution (we still need to pick a distribution type first) using data and apply MAP on it.

Linear Classifier

Linear classifier: where w \in \mathbb{R}^d is weight vector and w^Tx is a weighted sum of the features, b \in \mathbb{R} is bias.

h(x) = \begin{cases} 0 & \text{if } w^Tx + b \leq 0\\ 1 & \text{otherwise}\\ \end{cases}

Support Vector Machine (SVM): learns linear classifier that maximizes the separation between data points of different labels.

Table of Content