Midterm

Edit Distance

Minimum Edit Distance: minimum number of editing operations (insert, delete, subsitute) needed to transform one into the other.

To fill an entry: either +1 from down or right, or +2/+0 from diagonal depending whether characters match

To fill an entry: either +1 from down or right, or +2/+0 from diagonal depending whether characters match

If we remember our choice, we can build backtrace to find the actual edits.

Tokenization

Heaps(Herdan's) Law: V = kN^\beta where V is the number of unique words in a corpus of size N, and k and \beta are constants. Typically, \beta \approx 0.5.

Morpheme: the smallest meaning-bearing unit in a language. For example, "cats" consists of two morphemes: "cat" (the root) and "-s" (the plural suffix).

Unicode:

Bytes Byte pattern (binary) Meaning
1 0xxxxxxx ASCII (U+0000–U+007F)
2 110xxxxx 10xxxxxx U+0080–U+07FF
3 1110xxxx 10xxxxxx 10xxxxxx U+0800–U+FFFF
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U+10000–U+10FFFF

Byte Pair Encoding (BPE), Unigram language modeling tokenization:

Corpora:

N-Grams

Probability Cheat Sheet:

Markov Assumption:P(\text{egg} | \text{I do not like green}) = \frac{\text{count}(\text{I do not like green egg})}{\text{count}(\text{I do not like green})} \simeq P(\text{egg} | \text{like green})

Unigram: P(w_1, ..., w_n) \approx \prod_{i=1}^n P(w_i) (assumes each word is independent)

Bigram: P(w_1, ..., w_n) \approx \prod_{i=1}^n P(w_i | w_{i-1}) (assumes each word depends only on the previous word)

N-gram: P(w_1, ..., w_n) \approx \prod_{i=1}^n P(w_i | w_{i-(n-1)}, ..., w_{i-1}) (assumes each word depends on the previous n-1 words)

Small n-gram fail because language has long-range dependencies

Bigram estimation of sentence probabilities: P(w_1, ..., w_n) \approx P(w_1 | <s>) \prod_{i=2}^n P(w_i | w_{i-1}) P(</s> | w_n) where <s> and </s> are start and end tokens.

Perplexity

Extrinsic evaluation: evaluate on a downstream task (e.g. machine translation, question answering, - time consuming)

Intrinsic evaluation: good model assigns a higher probability to the word that actually occurs.

Perplexity: ppl(w_1, ..., w_n) = 2^{-\frac{1}{n} \sum_{i=1}^n \log_2 P(w_i | w_{i-(n-1)}, ..., w_{i-1})}

Prevent Zero Probabilities

Bayes' Rule

P(A | B) = \frac{P(B | A) P(A)}{P(B)}

Likelihood: P(B | A), how likely we are to see the data if the hypothesis is true.

Prior: P(A), how likely we think the hypothesis is before seeing the data.

Marginal Likelihood: P(B), how likely we are to see the data under all possible hypotheses.

Since we only care about making \arg\max prediction, P(f) is constant across all classes, so we can ignore it:

\begin{align*} \text{classify}(f) &= \arg\max_{c \in C} P(c | f)\\ &= \arg\max_{c \in C} \frac{P(f, c)}{P(f)}\\ &= \arg\max_{c \in C} P(f, c) \tag{$P(f)$ constant}\\ &= \arg\max_{c \in C} p(c, f_1, ..., f_n)\\ &= \arg\max_{c \in C} p(f_1 | f_2, ..., f_n, c) ... p(f_{n-1} | f_n, c) p(f_n | c) p(c) \tag{chain rule}\\ &\approx \arg\max_{c \in C} p(f_1 | c) ... p(f_{n-1} | c) p(f_n | c) p(c) \tag{Assume $f_i$ are independent given $c$}\\ &= \arg\max_{c \in C} p(c) \prod_{i=1}^n p(f_i | c) \tag{Naive Bayes} \end{align*}

Based on above, if a word w never appear in one class c, then all document with w will have zero probability by classified as c. Smoothing (at inference): P(w | c) = \frac{count(w, c) + \alpha}{(\sum_{w' \in V} count(w', c)) + \alpha V} where \alpha is a smoothing parameter (usually 1) and V is the vocabulary size.

\text{classify}(f) = \arg\max_{c \in C} (\log p(c) + \sum_{i=1}^{|f|} \log p(f_i | c))

Log probabilities are for numerical stability.

Evaluation

TP: correctly predicted positive cases TN: correctly predicted negative cases FP: incorrectly predicted positive cases (originally negative) FN: incorrectly predicted negative cases (originally positive)

Accuracy: \frac{TP + TN}{TP + TN + FP + FN} (minimize miss-classification) Precision: \frac{TP}{TP + FP} (how many correct that is predicted as positive) Recall: \frac{TP}{TP + FN} (how many correct that is actually positive) F1 Score: 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

The majority class (“not woke”) performs well, the minority class (“woke”) performs poorly. Then: Micro F1 will be high (dominated by majority class). Macro F1 will be lower (because minority class F1 pulls down the average).

Micro and Macro average precision

Micro and Macro average precision

micro F1 does not penalize minority misclassifications more; it effectively weights by class frequency; macro F1 would only be higher if the minority class performed very well.

Logistic Regression

Naive Bayes: \hat{c} = \arg\max_{c \in C} P(f, c) (generative model)

Logistic Regression: \hat{c} = \arg\max_{c \in C} P(c | f) (discriminative model)

Logistic Regression:

(Discrete) Cross-Entropy Loss (derived from Shannon Entropy):

L_{CE}(\hat{y}, y) = -\sum_{i=1}^{|C|} y_i \log \hat{y}_i

where y is true probability distribution (one-hot vector) and \hat{y} is predicted probability distribution of C classes.

w_{t+1} = w_t - \eta \frac{\partial}{\partial w} L_{CE}(\hat{y}, y)

Embedding

Meaning of word defined by its context (distributional hypothesis)

Meaning of a word is a function of

  1. global distribution of the type
  2. immediate context of the token
\cos(\vec{v}_1, \vec{v}_2) = \frac{\vec{v}_1 \cdot \vec{v}_2}{\|\vec{v}_1\| \|\vec{v}_2\|}

Dense Vector: not produced by count

Skip-gram:

Continuous Bag of Words (CBOW):

Feed-Forward Neural Networks

LR cannot model non-trivial interactions between features. (XOR is non-linear separable function) One hidden layer can solve XOR problem.

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}

where K is the number of classes and z_i is the input to the softmax function for class i.

Sigmoid: binary classification. Softmax: multi-class.

FFNN:

Regularization:

LLM

Encoder: bidirectional, for classification Encoder-Decoder: machine translation, speech recognition Decoder-Only: unidirectional, for generation (Causal LLM, AR)

Decoding:

Finetuning

  1. continued pretraining: adapt parameter to new data
  2. LoRA: add new parameter
  3. MLM (mask some tokens and predict them): train classification head
  4. Instruction Tuning

Sentence log likelihood: LL(w_{1:n}) = \log \prod_{i=1}^n P(w_i | w_{1:i-1}) = \sum_{i=1}^n \log P(w_i | w_{1:i-1})

Perplexity: ppl(w_{1:n}) = \sqrt[n]{\prod_{i=1}^n \frac{1}{P(w_i | w_{1:i-1})}} = P(w_{1:n})^{-\frac{1}{n}} (inverse probability \theta assign to test set, normalized by test set length, sensitive to length/tokenization)

Other concern: hallucination, privacy, bias, copyright, energy consumption, adult content, deduplication, dynamic data drift

Transformers

Self-Attention

Self-Attention

My Own Interpretation

My Own Interpretation

t_i^1 = LayerNorm(x)
t_i^2 = MultiHeadAttention(t_i^1, [x_1, ..., x_n])
t_i^3 = t_i^2 + x_i
t_i^4 = LayerNorm(t_i^3)
t_i^5 = FeedForward(t_i^4)
t_i^6 = t_i^5 + t_i^3

Post Training

Post Training

Preference Tuning:

Encoder LLMs

ELMo: context-dependent embedding for each word interpolates representations for that word from each layer of stacked bidirectional LSTM.

BERT (Bidirectional Encoder Representations from Transformers): pretrain with masked language modeling (MLM) and next sentence prediction (NSP) objectives.

Parameter-efficient fine-tuning:

Evaluation:

Retrieval

Goal: assign score to each document (can be approximated as cosine similarity between query and document)

Sparse Retrival

TF-IDF (Term Frequency-Inverse Document Frequency): don't let "the" dominate the similarity score

Note that a document contain no query term will have 0 score. In practice, we store a mapping from term to documents (instead of document to term), and pre-computed TF.

When we increase rank, precision jumps while recall increases.

Mean Average Precision: AP(q) = \frac{1}{R_r} \sum_{d \in R_r} \text{Precision}_r(d, q) where R_r is the set of relevant documents at or above rank r. MAP = \frac{1}{|Q|} \sum_{q \in Q} AP(q) where Q is the set of queries.

Dense Retrival

Single Encoder: \text{score}(q, d) = \text{softmax}(\text{Linear}(\text{BERT}(q; [SEP]; d)[CLS]))

Bi-Encoder: \text{score}(q, d) = \text{cosine}(\text{BERT}_q(q)[CLS], \text{BERT}_d(d)[CLS]) (cheaper, less accurate)

ColBERT: \text{score}(q, d) = \sum_{i=1}^{|q|} \max_{j=1}^{|d|} \text{cosine}(\text{BERT}_q(q_i), \text{BERT}_d(d_j)) (precompute each word vector in document)

Faiss: approximate nearest neighbor search in vector space.

RAG (Retrieval-Augmented Generation): retrieve k document (by sparse/dense/LLM function calling agent), feed it to prompt

Comparison

Aspect N-gram LM Feed-Forward LM Transformer-based Masked LM
Modeling Assumption ( (n-1) )-order Markov Fixed window neural approximation Global self-attention over sequence
Context Length Fixed, short Fixed, moderate Variable, up to model limit
Long-Range Dependencies ❌ None beyond window ❌ None beyond window ✅ Modeled via attention
Representation Type Discrete counts Dense embeddings Deep contextual embeddings
Parameter Sharing Minimal Moderate Extensive (layered attention blocks)
Intrinsic Metric (Perplexity / CE) High Moderate Very low (masked CE)
Data Efficiency (Small Data) Strong (with smoothing) Moderate Weak without pretraining
Scaling with Data Plateaus early Improves steadily Follows scaling laws; improves predictably
Transfer Learning ❌ None Limited ✅ Strong fine-tuning capability
Zero-shot Generalization Limited (better in autoregressive LLMs than MLMs)
Robustness to OOV ❌ Severe Mitigated via embeddings Subword tokenization mitigates strongly
Training Complexity Counting Backpropagation Large-scale distributed optimization
Optimization Stability Deterministic Stable Sensitive to LR schedules & scale
Parallelization (Training) Trivial Moderate Highly parallelizable
Inference Parallelism Trivial Per-token sequential Fully parallel across tokens (MLM)
Inference Latency Very low Low Moderate–High
Memory Footprint (Parameters) Large tables Moderate Very large (100M+)
Memory Footprint (Runtime) Small Moderate High (activations + attention maps)
Time Complexity (Sequence length L) ( \mathcal{O}(1) ) per token ( \mathcal{O}(1) ) per token ( \mathcal{O}(L^2) ) attention
Hardware Requirements CPU sufficient CPU/GPU GPU/TPU required
Interpretability High (explicit counts) Moderate Low (distributed representations)
Calibration Often poorly calibrated Better Often miscalibrated without tuning
Domain Adaptation Cost Low (re-count) Moderate retraining Fine-tuning required
Catastrophic Forgetting Not applicable Moderate Significant without care
Deployment Simplicity Very simple Moderate Complex (tokenizer, large weights)
Energy Consumption Minimal Moderate Very high
Carbon Footprint (Training) Negligible Moderate Substantial
Regulatory / Explainability Risk Low Medium High
Best Use Cases Embedded systems, ASR fallback Legacy MT, moderate NLP Search, QA, classification, semantic tasks
Failure Mode Data sparsity Context truncation Compute bottlenecks & memory limits

Table of Content