Midterm

Edit Distance

Minimum Edit Distance: minimum number of editing operations (insert, delete, subsitute) needed to transform one into the other.

$D(i, j)$ = minimum edit distance between the first $i$ characters of string 1 and the first $j$ characters of string 2
Initialization: $D(0, j) = j$ and $D(i, 0) = i$
Recurrence: $D(i, j) = min(D(i-1, j) + 1, D(i, j-1) + 1, D(i-1, j-1) + cost)$ where $cost = 0$ if $s_1[i] = s_2[j]$ else $cost = 2$
Time $O(|s_1||s_2|)$ , space $O(|s_1||s_2|)$ , backtrace $O(|s_1| + |s_2|)$

To fill an entry: either +1 from down or right, or +2/+0 from diagonal depending whether characters match

If we remember our choice, we can build backtrace to find the actual edits.

Tokenization

Heaps(Herdan's) Law: $V = kN^\beta$ where $V$ is the number of unique words in a corpus of size $N$ , and $k$ and $\beta$ are constants. Typically, $\beta \approx 0.5$ .

Morpheme: the smallest meaning-bearing unit in a language. For example, "cats" consists of two morphemes: "cat" (the root) and "-s" (the plural suffix).

Root: e.g., "work" in "working"
Affix: e.g., "-ed" in "worked"
Inflectional morphemes: modify a word's tense, number, etc. without changing its core meaning (e.g., "-ed" for past tense).
Derivational morphemes: create a new word with a new meaning (e.g., "careful" from "care").
Clitics: syntactically behave like words but are phonologically attached to another word (e.g., "'ve" in "I've").
Morphological typology: classifies languages based on their morphological structure, such as agglutinative (clear morpheme boundaries) vs. fusional (conflated morphemes).

Unicode:

ASCII: 7-bit encoding, 128 characters (95 printable) e.g. 0x00 to 0x7F
Unicode code points: e.g. U+0061 for 'a'. First 127 code points match ASCII.
UTF-8: directly store code point is long, so uses variable-length encoding:
- 1 byte for ASCII (0-127)
- 2 bytes for most European/Middle Eastern/African scripts
- 3 bytes for most CJK characters
- 4 bytes for rarer CJK characters and emojis/symbols
- fewer bytes for "common" characters
- Python len counts Unicode code points, not bytes.
- friendly for file corruption (can find character boundaries)

Bytes	Byte pattern (binary)	Meaning
1	`0xxxxxxx`	ASCII (U+0000–U+007F)
2	`110xxxxx 10xxxxxx`	U+0080–U+07FF
3	`1110xxxx 10xxxxxx 10xxxxxx`	U+0800–U+FFFF
4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	U+10000–U+10FFFF

Byte Pair Encoding (BPE), Unigram language modeling tokenization:

Other methods:
- white-space: not general to language
- unicode / characters: too small as token
- morphemes: hard to identify and define
Method: iteratively merge frequent pairs of tokens
- choose most frequent pair of token
- make the two into a new token
- replace all occurrences of the pair with the new token
Addition:
- no merge across word boundaries
Encoder: greedily merge in the order we leaned

Corpora:

dimensions: variety (e.g. African English), genre (e.g. news), demographics (e.g. age)

N-Grams

Probability Cheat Sheet:

$Pr(A | B) = \frac{Pr(A, B)}{Pr(B)}$

Markov Assumption: $P(\text{egg} | \text{I do not like green}) = \frac{\text{count}(\text{I do not like green egg})}{\text{count}(\text{I do not like green})} \simeq P(\text{egg} | \text{like green})$

Unigram: $P(w_1, ..., w_n) \approx \prod_{i=1}^n P(w_i)$ (assumes each word is independent)

Bigram: $P(w_1, ..., w_n) \approx \prod_{i=1}^n P(w_i | w_{i-1})$ (assumes each word depends only on the previous word)

N-gram: $P(w_1, ..., w_n) \approx \prod_{i=1}^n P(w_i | w_{i-(n-1)}, ..., w_{i-1})$ (assumes each word depends on the previous n-1 words)

Small n-gram fail because language has long-range dependencies

Bigram estimation of sentence probabilities: $P(w_1, ..., w_n) \approx P(w_1 | <s>) \prod_{i=2}^n P(w_i | w_{i-1}) P(</s> | w_n)$ where $<s>$ and $</s>$ are start and end tokens.

Perplexity

Extrinsic evaluation: evaluate on a downstream task (e.g. machine translation, question answering, - time consuming)

Intrinsic evaluation: good model assigns a higher probability to the word that actually occurs.

Perplexity: $ppl(w_1, ..., w_n) = 2^{-\frac{1}{n} \sum_{i=1}^n \log_2 P(w_i | w_{i-(n-1)}, ..., w_{i-1})}$

lower perplexity = better model, minimizing perplexity is maximizing probability
$ppl$ is average bits required to encode each word under this model (exponent is entropy: average negative log likelihood per word)

Prevent Zero Probabilities

Smoothing: This prevents zero probabilities for unseen bigrams. This introduce bias while correct variance of finite dataset.
- $P(w_i | w_{i-1}) = \frac{count(w_{i-1}, w_i) + \alpha}{count(w_{i-1}) + \alpha V}$ where $\alpha$ is a smoothing parameter (usually 1) and $V$ is the vocabulary size.
Backoff: use bigram or unigram
Interpolation: mix bigram, unigram, trigram stats (better) $P(w_i | w_{i-1}, w_{i-2}) = \lambda_1 P(w_i | w_{i-1}, w_{i-2}) + \lambda_2 P(w_i | w_{i-1}) + \lambda_3 P(w_i)$ where $\lambda_1 + \lambda_2 + \lambda_3 = 1$ .
Extended interpolated Kneser-Ney smoothing: SOTA
Unseen words: choose fix vocab size $V$ , and replace unknown words as <UNK> in training set and predict <UNK>. (or use sub-word solutions like BPE)

Bayes' Rule

$P(A | B) = \frac{P(B | A) P(A)}{P(B)}$

Likelihood: $P(B | A)$ , how likely we are to see the data if the hypothesis is true.

Prior: $P(A)$ , how likely we think the hypothesis is before seeing the data.

Marginal Likelihood: $P(B)$ , how likely we are to see the data under all possible hypotheses.

Since we only care about making $\arg\max$ prediction, $P(f)$ is constant across all classes, so we can ignore it:

$\begin{align*} \text{classify}(f) &= \arg\max_{c \in C} P(c | f)\\ &= \arg\max_{c \in C} \frac{P(f, c)}{P(f)}\\ &= \arg\max_{c \in C} P(f, c) \tag{$P(f)$ constant}\\ &= \arg\max_{c \in C} p(c, f_1, ..., f_n)\\ &= \arg\max_{c \in C} p(f_1 | f_2, ..., f_n, c) ... p(f_{n-1} | f_n, c) p(f_n | c) p(c) \tag{chain rule}\\ &\approx \arg\max_{c \in C} p(f_1 | c) ... p(f_{n-1} | c) p(f_n | c) p(c) \tag{Assume $f_i$ are independent given $c$}\\ &= \arg\max_{c \in C} p(c) \prod_{i=1}^n p(f_i | c) \tag{Naive Bayes} \end{align*}$

Based on above, if a word $w$ never appear in one class $c$ , then all document with $w$ will have zero probability by classified as $c$ . Smoothing (at inference): $P(w | c) = \frac{count(w, c) + \alpha}{(\sum_{w' \in V} count(w', c)) + \alpha V}$ where $\alpha$ is a smoothing parameter (usually 1) and $V$ is the vocabulary size.

$\text{classify}(f) = \arg\max_{c \in C} (\log p(c) + \sum_{i=1}^{|f|} \log p(f_i | c))$

Log probabilities are for numerical stability.

Evaluation

TP: correctly predicted positive cases TN: correctly predicted negative cases FP: incorrectly predicted positive cases (originally negative) FN: incorrectly predicted negative cases (originally positive)

Accuracy: $\frac{TP + TN}{TP + TN + FP + FN}$ (minimize miss-classification) Precision: $\frac{TP}{TP + FP}$ (how many correct that is predicted as positive) Recall: $\frac{TP}{TP + FN}$ (how many correct that is actually positive) F1 Score: $2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$

The majority class (“not woke”) performs well, the minority class (“woke”) performs poorly. Then: Micro F1 will be high (dominated by majority class). Macro F1 will be lower (because minority class F1 pulls down the average).

micro F1 does not penalize minority misclassifications more; it effectively weights by class frequency; macro F1 would only be higher if the minority class performed very well.

Logistic Regression

Naive Bayes: $\hat{c} = \arg\max_{c \in C} P(f, c)$ (generative model)

Logistic Regression: $\hat{c} = \arg\max_{c \in C} P(c | f)$ (discriminative model)

Logistic Regression:

classification: sigmoid ( $\hat{y} = \frac{1}{1 + e^{-z}}$ where $z = w^T x + b$ ) (logistic function, to convert to probability)
objective: cross-entropy loss
optimization: gradient descent (move opposite of gradient of loss function)

(Discrete) Cross-Entropy Loss (derived from Shannon Entropy):

$L_{CE}(\hat{y}, y) = -\sum_{i=1}^{|C|} y_i \log \hat{y}_i$

where $y$ is true probability distribution (one-hot vector) and $\hat{y}$ is predicted probability distribution of $C$ classes.

$w_{t+1} = w_t - \eta \frac{\partial}{\partial w} L_{CE}(\hat{y}, y)$

Embedding

Meaning of word defined by its context (distributional hypothesis)

The smaller the context (±1 − 3) , the more syntactic (First-order co-occurrence, syntagmatic) the representation ("wrote" to "book")
The larger the context (±4 − 10), the more semantic (Second-order co-occurrence, paradigmatic) the representation ("wrote" to "said")

Meaning of a word is a function of

global distribution of the type
immediate context of the token

$\cos(\vec{v}_1, \vec{v}_2) = \frac{\vec{v}_1 \cdot \vec{v}_2}{\|\vec{v}_1\| \|\vec{v}_2\|}$

Dense Vector: not produced by count

advantage: (1) capture synonymy, (2) generalize better, (3) easier to use
obtain: (1) Single Value Decomposition (SVD) (2) Brown clustering (3) Neural Language Model (Word2Vec, GloVe)

Skip-gram:

Objective: predict context given center word (input: one-hot, output: context probability)
Negative Sample: random noise
Advantage: faster than SVD in training, capture syntatic than CBOW

Continuous Bag of Words (CBOW):

Objective: predict center word given context
Advantage: faster than skip-gram in training, capture semantic than skip-gram

Feed-Forward Neural Networks

LR cannot model non-trivial interactions between features. (XOR is non-linear separable function) One hidden layer can solve XOR problem.

$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$

where $K$ is the number of classes and $z_i$ is the input to the softmax function for class $i$ .

Sigmoid: binary classification. Softmax: multi-class.

FFNN:

$z = \sigma(\theta^{(x \to z)} x + b)$ where $\sigma$ is an activation function (e.g., ReLU, sigmoid)
$P(y | z) = \text{softmax}(\theta^{(z \to y)} z + b)$
this makes $P(y | x)$ non-linear function of $x$
Choices of activation function:
- Hyperbolic tangent (tanh): avoid saturation, 0 centered
- Rectified linear (ReLU): fast, avoid saturation
- Leaky ReLU, GELU

Regularization:

L1 Regularization: $L_1 = \lambda \sum_{i} |w_i|$ (encourage sparsity, or zero weights)
L2 Regularization: $L_2 = \lambda \sum_{i} w_i^2$ (encourage small magnitude)
Dropout: set weight to 0 with probability $p$ in training, normalize by $1/p$ in inference

LLM

Encoder: bidirectional, for classification Encoder-Decoder: machine translation, speech recognition Decoder-Only: unidirectional, for generation (Causal LLM, AR)

Decoding:

Greedy: not used, generic and repetitive, deterministic
Sampling: low-probability add up
Temperature: $y = \text{softmax}(z / T)$ , $T < 1$ makes distribution sharper.
Beam Search: keep top $k$ candidates at each step, output the one with highest overall probability
Nucleus Sampling: choose smallest set of tokens whose cumulative probability exceeds $p$ (e.g. 0.9), then sample from that set.
Top-k Sampling: choose top $k$ tokens with highest probability, then sample from that set.

Finetuning

continued pretraining: adapt parameter to new data
LoRA: add new parameter
MLM (mask some tokens and predict them): train classification head
Instruction Tuning

Sentence log likelihood: $LL(w_{1:n}) = \log \prod_{i=1}^n P(w_i | w_{1:i-1}) = \sum_{i=1}^n \log P(w_i | w_{1:i-1})$

Perplexity: $ppl(w_{1:n}) = \sqrt[n]{\prod_{i=1}^n \frac{1}{P(w_i | w_{1:i-1})}} = P(w_{1:n})^{-\frac{1}{n}}$ (inverse probability $\theta$ assign to test set, normalized by test set length, sensitive to length/tokenization)

Other concern: hallucination, privacy, bias, copyright, energy consumption, adult content, deduplication, dynamic data drift

Transformers

t_i^1 = LayerNorm(x)
t_i^2 = MultiHeadAttention(t_i^1, [x_1, ..., x_n])
t_i^3 = t_i^2 + x_i
t_i^4 = LayerNorm(t_i^3)
t_i^5 = FeedForward(t_i^4)
t_i^6 = t_i^5 + t_i^3

Post Training

Instruction Tuning: finetuning w/ in-context learning
Direct Preference Optimization (DPO)
RLHF
Test-Time Compute: in-context (prompt engineering, few-shot, CoT prompt)

Preference Tuning:

Reward Model: given ranking data, train classification head $r$ so that $r(x_1) > r(x_2)$ if $x_1$ is preferred over $x_2$ . ( $-\log \sigma(r(x_1) - r(x_2))$ )
Preference-Based Alignment: adding reward-based objective
Direct Preference Optimization: directly optimize for preference without reward model, by maximizing $\log \sigma(r(x_1) - r(x_2))$ where $r(x) = \log P(x) + \alpha \log \frac{P_\theta(x)}{P(x)}$ and $P$ is the original LM and $P_\theta$ is the finetuned LM. (if $\alpha = 0$ , then just maximize likelihood of preferred response; if $\alpha = \infty$ , then just maximize KL divergence between finetuned and original LM)

Encoder LLMs

ELMo: context-dependent embedding for each word interpolates representations for that word from each layer of stacked bidirectional LSTM.

BERT (Bidirectional Encoder Representations from Transformers): pretrain with masked language modeling (MLM) and next sentence prediction (NSP) objectives.

classification: use [CLS] token (1st token output whether the next sentence is a continuation of the previous sentence).
sentence-pair classification: use [SEP] token (cross-encoder) to separate sentences.
sequence labeling: use final tokens (or pool from multiple tokens).
reading comprehension (span prediction): predict start and end token of answer span using a portion of tokens.

Parameter-efficient fine-tuning:

Masking (Zhao et al. 2020): Simply learn a binary mask over all parameters!
Di Pruning (Guo et al. 2021): Learn a sparse additive mask over all parameters.
BitFit (Zaken et al. 2022): Fine-tune just the bias vectors (or a subset of them).
LoRA (Hu et al. 2021): Learn small additive weight.

Evaluation:

(Super)GLUE (General Language Understanding Evaluation): multi-task benchmark
SQuAD (Stanford Question Answering Dataset): reading comprehension
wikipedia, fiction is good data, web crawl is bad. task-aligned data is best.

Retrieval

Goal: assign score to each document (can be approximated as cosine similarity between query and document)

Sparse retrieval: represent query and document as sparse vectors (e.g. TF-IDF, or simple word count)
Dense retrieval: represent query and document as dense vectors (e.g. BERT)

Sparse Retrival

TF-IDF (Term Frequency-Inverse Document Frequency): don't let "the" dominate the similarity score

$tf(t, d) = \frac{f_{t, d}}{\sum_{t' \in d} f_{t', d}}$ where $f_{t, d}$ is the frequency of term $t$ in document $d$ (or $tf(t,d) = \log(f_{t, d} + 1)$ , with log to prevent domination by very frequent terms)
$idf(t, D) = \log \frac{N}{|\{d \in D : t \in d\}|}$ where $N$ is the total number of documents and $|\{d \in D : t \in d\}|$ is the number of documents containing term $t$
$tfidf(t, d, D) = tf(t, d) \cdot idf(t, D)$

Note that a document contain no query term will have 0 score. In practice, we store a mapping from term to documents (instead of document to term), and pre-computed TF.

When we increase rank, precision jumps while recall increases.

Mean Average Precision: $AP(q) = \frac{1}{R_r} \sum_{d \in R_r} \text{Precision}_r(d, q)$ where $R_r$ is the set of relevant documents at or above rank $r$ . $MAP = \frac{1}{|Q|} \sum_{q \in Q} AP(q)$ where $Q$ is the set of queries.

Dense Retrival

Single Encoder: $\text{score}(q, d) = \text{softmax}(\text{Linear}(\text{BERT}(q; [SEP]; d)[CLS]))$

Bi-Encoder: $\text{score}(q, d) = \text{cosine}(\text{BERT}_q(q)[CLS], \text{BERT}_d(d)[CLS])$ (cheaper, less accurate)

ColBERT: $\text{score}(q, d) = \sum_{i=1}^{|q|} \max_{j=1}^{|d|} \text{cosine}(\text{BERT}_q(q_i), \text{BERT}_d(d_j))$ (precompute each word vector in document)

Faiss: approximate nearest neighbor search in vector space.

RAG (Retrieval-Augmented Generation): retrieve $k$ document (by sparse/dense/LLM function calling agent), feed it to prompt

Comparison

Aspect	N-gram LM	Feed-Forward LM	Transformer-based Masked LM
Modeling Assumption	( (n-1) )-order Markov	Fixed window neural approximation	Global self-attention over sequence
Context Length	Fixed, short	Fixed, moderate	Variable, up to model limit
Long-Range Dependencies	❌ None beyond window	❌ None beyond window	✅ Modeled via attention
Representation Type	Discrete counts	Dense embeddings	Deep contextual embeddings
Parameter Sharing	Minimal	Moderate	Extensive (layered attention blocks)
Intrinsic Metric (Perplexity / CE)	High	Moderate	Very low (masked CE)
Data Efficiency (Small Data)	Strong (with smoothing)	Moderate	Weak without pretraining
Scaling with Data	Plateaus early	Improves steadily	Follows scaling laws; improves predictably
Transfer Learning	❌ None	Limited	✅ Strong fine-tuning capability
Zero-shot Generalization	❌	❌	Limited (better in autoregressive LLMs than MLMs)
Robustness to OOV	❌ Severe	Mitigated via embeddings	Subword tokenization mitigates strongly
Training Complexity	Counting	Backpropagation	Large-scale distributed optimization
Optimization Stability	Deterministic	Stable	Sensitive to LR schedules & scale
Parallelization (Training)	Trivial	Moderate	Highly parallelizable
Inference Parallelism	Trivial	Per-token sequential	Fully parallel across tokens (MLM)
Inference Latency	Very low	Low	Moderate–High
Memory Footprint (Parameters)	Large tables	Moderate	Very large (100M+)
Memory Footprint (Runtime)	Small	Moderate	High (activations + attention maps)
Time Complexity (Sequence length L)	( \mathcal{O}(1) ) per token	( \mathcal{O}(1) ) per token	( \mathcal{O}(L^2) ) attention
Hardware Requirements	CPU sufficient	CPU/GPU	GPU/TPU required
Interpretability	High (explicit counts)	Moderate	Low (distributed representations)
Calibration	Often poorly calibrated	Better	Often miscalibrated without tuning
Domain Adaptation Cost	Low (re-count)	Moderate retraining	Fine-tuning required
Catastrophic Forgetting	Not applicable	Moderate	Significant without care
Deployment Simplicity	Very simple	Moderate	Complex (tokenizer, large weights)
Energy Consumption	Minimal	Moderate	Very high
Carbon Footprint (Training)	Negligible	Moderate	Substantial
Regulatory / Explainability Risk	Low	Medium	High
Best Use Cases	Embedded systems, ASR fallback	Legacy MT, moderate NLP	Search, QA, classification, semantic tasks
Failure Mode	Data sparsity	Context truncation	Compute bottlenecks & memory limits

Table of Content