# Lecture 006

## Classification

### Problem Formation

Assuming we have training data (feature vector and level):

\{(X_1, Y_1), (X_2, Y_2), (X_3, Y_3), ..., (X_n, Y_n)\}

We assume that the joint distribution are i.i.d.

(\forall i)(f_{X_i, Y_u}(x_i, y_i) \sim f_{X, Y}(x, y))

We need a function $h(X)$ to minimize Generalization Error:

R(h) = Pr\{h(X) \neq Y\}

### Bayes Classifier Given Distribution

Assume we have the joint distribution $f_{X, Y}(x, y) = f_{X | Y = y}(x)f_{Y}(y)$ (or likelihood $f_{X | Y}(x)$ and prior $Pr\{Y = 1\}, Pr\{Y = 0\}$), binary classification is equivalent to hypothesis testing using MAP decision rule.

H_0: Y = 0, H_1: Y = 1

Bayes Classifier: Given that we know joint p.d.f. We can compair the join distribution $Pr\{X = x | Y = 0\}$ with $Pr\{X = x | Y = 1\}$.

h^*(x) = \begin{cases} 1 & \text{if } Pr\{Y = 1 | X = x\} \leq \frac{1}{2}\\ 0 & \text{otherwise} \end{cases}

We can write $m: \mathbb{R}^d \to \mathbb{R}$ as a regression function:

m(x) = Pr\{Y = 1 | X = x\}

Since MAP minimizes Generalization Error, Bayes Classifier is the optimal classifier.

### MAP Classifier Given Distribution Type

Use data to estimate true distribution:

• training data: $(X_1, Y_1), (X_2, Y_2), ..., (X_n, Y_n)$

• testing data: $(T_1, L_1), (T_2, L_2), ..., (T_m, L_m)$

Training: we see estimated distribution as a classifier

Training Error (empirical risk): the fraction of training data that gets classified to the wrong label under classifier $h$.

\widehat{R}(h) = \frac{1}{n} \sum_{i = 1}^n \mathbb{1}_{\{h(X_i) \neq Y_i\}}

This is random because the training data is random

Testing Error: since we need a way to estimate error, we need a test set that is independent from training set. We hope test error is an estimation of generalization error.

\widetilde{R}(h) = \frac{1}{m} \sum_{i = 1}^m \mathbb{1}_{\{h(T_i) \neq L_i\}}

This is random because the training data is random. A good test set will approximate the Generalization Error.

We can estimate parameter of the distribution (we still need to pick a distribution type first) using data and apply MAP on it.

### Linear Classifier

Linear classifier: where $w \in \mathbb{R}^d$ is weight vector and $w^Tx$ is a weighted sum of the features, $b \in \mathbb{R}$ is bias.

h(x) = \begin{cases} 0 & \text{if } w^Tx + b \leq 0\\ 1 & \text{otherwise}\\ \end{cases}

Support Vector Machine (SVM): learns linear classifier that maximizes the separation between data points of different labels.

Table of Content