Minimum Edit Distance: minimum number of editing operations (insert, delete, subsitute) needed to transform one into the other.
D(i, j) = minimum edit distance between the first i characters of string 1 and the first j characters of string 2
Initialization: D(0, j) = j and D(i, 0) = i
Recurrence: D(i, j) = min(D(i-1, j) + 1, D(i, j-1) + 1, D(i-1, j-1) + cost) where cost = 0 if s_1[i] = s_2[j] else cost = 2
Time O(|s_1||s_2|), space O(|s_1||s_2|), backtrace O(|s_1| + |s_2|)

If we remember our choice, we can build backtrace to find the actual edits.
Heaps(Herdan's) Law: V = kN^\beta where V is the number of unique words in a corpus of size N, and k and \beta are constants. Typically, \beta \approx 0.5.
Morpheme: the smallest meaning-bearing unit in a language. For example, "cats" consists of two morphemes: "cat" (the root) and "-s" (the plural suffix).
Root: e.g., "work" in "working"
Affix: e.g., "-ed" in "worked"
Inflectional morphemes: modify a word's tense, number, etc. without changing its core meaning (e.g., "-ed" for past tense).
Derivational morphemes: create a new word with a new meaning (e.g., "careful" from "care").
Clitics: syntactically behave like words but are phonologically attached to another word (e.g., "'ve" in "I've").
Morphological typology: classifies languages based on their morphological structure, such as agglutinative (clear morpheme boundaries) vs. fusional (conflated morphemes).
Unicode:
ASCII: 7-bit encoding, 128 characters (95 printable) e.g. 0x00 to 0x7F
Unicode code points: e.g. U+0061 for 'a'. First 127 code points match ASCII.
UTF-8: directly store code point is long, so uses variable-length encoding:
len counts Unicode code points, not bytes.| Bytes | Byte pattern (binary) | Meaning |
|---|---|---|
| 1 | 0xxxxxxx |
ASCII (U+0000–U+007F) |
| 2 | 110xxxxx 10xxxxxx |
U+0080–U+07FF |
| 3 | 1110xxxx 10xxxxxx 10xxxxxx |
U+0800–U+FFFF |
| 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
U+10000–U+10FFFF |
Byte Pair Encoding (BPE), Unigram language modeling tokenization:
Other methods:
Method: iteratively merge frequent pairs of tokens
Addition:
Encoder: greedily merge in the order we leaned
Corpora:
Probability Cheat Sheet:
Markov Assumption:P(\text{egg} | \text{I do not like green}) = \frac{\text{count}(\text{I do not like green egg})}{\text{count}(\text{I do not like green})} \simeq P(\text{egg} | \text{like green})
Unigram: P(w_1, ..., w_n) \approx \prod_{i=1}^n P(w_i) (assumes each word is independent)
Bigram: P(w_1, ..., w_n) \approx \prod_{i=1}^n P(w_i | w_{i-1}) (assumes each word depends only on the previous word)
N-gram: P(w_1, ..., w_n) \approx \prod_{i=1}^n P(w_i | w_{i-(n-1)}, ..., w_{i-1}) (assumes each word depends on the previous n-1 words)
Small n-gram fail because language has long-range dependencies
Bigram estimation of sentence probabilities: P(w_1, ..., w_n) \approx P(w_1 | <s>) \prod_{i=2}^n P(w_i | w_{i-1}) P(</s> | w_n) where <s> and </s> are start and end tokens.
Extrinsic evaluation: evaluate on a downstream task (e.g. machine translation, question answering, - time consuming)
Intrinsic evaluation: good model assigns a higher probability to the word that actually occurs.
Perplexity: ppl(w_1, ..., w_n) = 2^{-\frac{1}{n} \sum_{i=1}^n \log_2 P(w_i | w_{i-(n-1)}, ..., w_{i-1})}
lower perplexity = better model, minimizing perplexity is maximizing probability
ppl is average bits required to encode each word under this model (exponent is entropy: average negative log likelihood per word)
Prevent Zero Probabilities
Smoothing: This prevents zero probabilities for unseen bigrams. This introduce bias while correct variance of finite dataset.
Backoff: use bigram or unigram
Interpolation: mix bigram, unigram, trigram stats (better) P(w_i | w_{i-1}, w_{i-2}) = \lambda_1 P(w_i | w_{i-1}, w_{i-2}) + \lambda_2 P(w_i | w_{i-1}) + \lambda_3 P(w_i) where \lambda_1 + \lambda_2 + \lambda_3 = 1.
Extended interpolated Kneser-Ney smoothing: SOTA
Unseen words: choose fix vocab size V, and replace unknown words as <UNK> in training set and predict <UNK>. (or use sub-word solutions like BPE)
Likelihood: P(B | A), how likely we are to see the data if the hypothesis is true.
Prior: P(A), how likely we think the hypothesis is before seeing the data.
Marginal Likelihood: P(B), how likely we are to see the data under all possible hypotheses.
Since we only care about making \arg\max prediction, P(f) is constant across all classes, so we can ignore it:
Based on above, if a word w never appear in one class c, then all document with w will have zero probability by classified as c. Smoothing (at inference): P(w | c) = \frac{count(w, c) + \alpha}{(\sum_{w' \in V} count(w', c)) + \alpha V} where \alpha is a smoothing parameter (usually 1) and V is the vocabulary size.
Log probabilities are for numerical stability.
TP: correctly predicted positive cases TN: correctly predicted negative cases FP: incorrectly predicted positive cases (originally negative) FN: incorrectly predicted negative cases (originally positive)
Accuracy: \frac{TP + TN}{TP + TN + FP + FN} (minimize miss-classification) Precision: \frac{TP}{TP + FP} (how many correct that is predicted as positive) Recall: \frac{TP}{TP + FN} (how many correct that is actually positive) F1 Score: 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}
The majority class (“not woke”) performs well, the minority class (“woke”) performs poorly. Then: Micro F1 will be high (dominated by majority class). Macro F1 will be lower (because minority class F1 pulls down the average).

micro F1 does not penalize minority misclassifications more; it effectively weights by class frequency; macro F1 would only be higher if the minority class performed very well.
Naive Bayes: \hat{c} = \arg\max_{c \in C} P(f, c) (generative model)
Logistic Regression: \hat{c} = \arg\max_{c \in C} P(c | f) (discriminative model)
Logistic Regression:
classification: sigmoid (\hat{y} = \frac{1}{1 + e^{-z}} where z = w^T x + b) (logistic function, to convert to probability)
objective: cross-entropy loss
optimization: gradient descent (move opposite of gradient of loss function)
(Discrete) Cross-Entropy Loss (derived from Shannon Entropy):
where y is true probability distribution (one-hot vector) and \hat{y} is predicted probability distribution of C classes.
Meaning of word defined by its context (distributional hypothesis)
The smaller the context (±1 − 3) , the more syntactic (First-order co-occurrence, syntagmatic) the representation ("wrote" to "book")
The larger the context (±4 − 10), the more semantic (Second-order co-occurrence, paradigmatic) the representation ("wrote" to "said")
Meaning of a word is a function of
Dense Vector: not produced by count
advantage: (1) capture synonymy, (2) generalize better, (3) easier to use
obtain: (1) Single Value Decomposition (SVD) (2) Brown clustering (3) Neural Language Model (Word2Vec, GloVe)
Skip-gram:
Objective: predict context given center word (input: one-hot, output: context probability)
Negative Sample: random noise
Advantage: faster than SVD in training, capture syntatic than CBOW
Continuous Bag of Words (CBOW):
Objective: predict center word given context
Advantage: faster than skip-gram in training, capture semantic than skip-gram
LR cannot model non-trivial interactions between features. (XOR is non-linear separable function) One hidden layer can solve XOR problem.
where K is the number of classes and z_i is the input to the softmax function for class i.
Sigmoid: binary classification. Softmax: multi-class.
FFNN:
z = \sigma(\theta^{(x \to z)} x + b) where \sigma is an activation function (e.g., ReLU, sigmoid)
P(y | z) = \text{softmax}(\theta^{(z \to y)} z + b)
this makes P(y | x) non-linear function of x
Choices of activation function:
Regularization:
L1 Regularization: L_1 = \lambda \sum_{i} |w_i| (encourage sparsity, or zero weights)
L2 Regularization: L_2 = \lambda \sum_{i} w_i^2 (encourage small magnitude)
Dropout: set weight to 0 with probability p in training, normalize by 1/p in inference
Encoder: bidirectional, for classification Encoder-Decoder: machine translation, speech recognition Decoder-Only: unidirectional, for generation (Causal LLM, AR)
Decoding:
Greedy: not used, generic and repetitive, deterministic
Sampling: low-probability add up
Temperature: y = \text{softmax}(z / T), T < 1 makes distribution sharper.
Beam Search: keep top k candidates at each step, output the one with highest overall probability
Nucleus Sampling: choose smallest set of tokens whose cumulative probability exceeds p (e.g. 0.9), then sample from that set.
Top-k Sampling: choose top k tokens with highest probability, then sample from that set.
Finetuning
Sentence log likelihood: LL(w_{1:n}) = \log \prod_{i=1}^n P(w_i | w_{1:i-1}) = \sum_{i=1}^n \log P(w_i | w_{1:i-1})
Perplexity: ppl(w_{1:n}) = \sqrt[n]{\prod_{i=1}^n \frac{1}{P(w_i | w_{1:i-1})}} = P(w_{1:n})^{-\frac{1}{n}} (inverse probability \theta assign to test set, normalized by test set length, sensitive to length/tokenization)
Other concern: hallucination, privacy, bias, copyright, energy consumption, adult content, deduplication, dynamic data drift


t_i^1 = LayerNorm(x)
t_i^2 = MultiHeadAttention(t_i^1, [x_1, ..., x_n])
t_i^3 = t_i^2 + x_i
t_i^4 = LayerNorm(t_i^3)
t_i^5 = FeedForward(t_i^4)
t_i^6 = t_i^5 + t_i^3
Post Training
Instruction Tuning: finetuning w/ in-context learning
Direct Preference Optimization (DPO)
RLHF
Test-Time Compute: in-context (prompt engineering, few-shot, CoT prompt)
Preference Tuning:
Reward Model: given ranking data, train classification head r so that r(x_1) > r(x_2) if x_1 is preferred over x_2. (-\log \sigma(r(x_1) - r(x_2)))
Preference-Based Alignment: adding reward-based objective
Direct Preference Optimization: directly optimize for preference without reward model, by maximizing \log \sigma(r(x_1) - r(x_2)) where r(x) = \log P(x) + \alpha \log \frac{P_\theta(x)}{P(x)} and P is the original LM and P_\theta is the finetuned LM. (if \alpha = 0, then just maximize likelihood of preferred response; if \alpha = \infty, then just maximize KL divergence between finetuned and original LM)
ELMo: context-dependent embedding for each word interpolates representations for that word from each layer of stacked bidirectional LSTM.
BERT (Bidirectional Encoder Representations from Transformers): pretrain with masked language modeling (MLM) and next sentence prediction (NSP) objectives.
classification: use [CLS] token (1st token output whether the next sentence is a continuation of the previous sentence).
sentence-pair classification: use [SEP] token (cross-encoder) to separate sentences.
sequence labeling: use final tokens (or pool from multiple tokens).
reading comprehension (span prediction): predict start and end token of answer span using a portion of tokens.
Parameter-efficient fine-tuning:
Masking (Zhao et al. 2020): Simply learn a binary mask over all parameters!
Di Pruning (Guo et al. 2021): Learn a sparse additive mask over all parameters.
BitFit (Zaken et al. 2022): Fine-tune just the bias vectors (or a subset of them).
LoRA (Hu et al. 2021): Learn small additive weight.
Evaluation:
(Super)GLUE (General Language Understanding Evaluation): multi-task benchmark
SQuAD (Stanford Question Answering Dataset): reading comprehension
wikipedia, fiction is good data, web crawl is bad. task-aligned data is best.
Goal: assign score to each document (can be approximated as cosine similarity between query and document)
Sparse retrieval: represent query and document as sparse vectors (e.g. TF-IDF, or simple word count)
Dense retrieval: represent query and document as dense vectors (e.g. BERT)
TF-IDF (Term Frequency-Inverse Document Frequency): don't let "the" dominate the similarity score
tf(t, d) = \frac{f_{t, d}}{\sum_{t' \in d} f_{t', d}} where f_{t, d} is the frequency of term t in document d (or tf(t,d) = \log(f_{t, d} + 1), with log to prevent domination by very frequent terms)
idf(t, D) = \log \frac{N}{|\{d \in D : t \in d\}|} where N is the total number of documents and |\{d \in D : t \in d\}| is the number of documents containing term t
tfidf(t, d, D) = tf(t, d) \cdot idf(t, D)
Note that a document contain no query term will have 0 score. In practice, we store a mapping from term to documents (instead of document to term), and pre-computed TF.
When we increase rank, precision jumps while recall increases.
Mean Average Precision: AP(q) = \frac{1}{R_r} \sum_{d \in R_r} \text{Precision}_r(d, q) where R_r is the set of relevant documents at or above rank r. MAP = \frac{1}{|Q|} \sum_{q \in Q} AP(q) where Q is the set of queries.
Single Encoder: \text{score}(q, d) = \text{softmax}(\text{Linear}(\text{BERT}(q; [SEP]; d)[CLS]))
Bi-Encoder: \text{score}(q, d) = \text{cosine}(\text{BERT}_q(q)[CLS], \text{BERT}_d(d)[CLS]) (cheaper, less accurate)
ColBERT: \text{score}(q, d) = \sum_{i=1}^{|q|} \max_{j=1}^{|d|} \text{cosine}(\text{BERT}_q(q_i), \text{BERT}_d(d_j)) (precompute each word vector in document)
Faiss: approximate nearest neighbor search in vector space.
RAG (Retrieval-Augmented Generation): retrieve k document (by sparse/dense/LLM function calling agent), feed it to prompt
| Aspect | N-gram LM | Feed-Forward LM | Transformer-based Masked LM |
|---|---|---|---|
| Modeling Assumption | ( (n-1) )-order Markov | Fixed window neural approximation | Global self-attention over sequence |
| Context Length | Fixed, short | Fixed, moderate | Variable, up to model limit |
| Long-Range Dependencies | ❌ None beyond window | ❌ None beyond window | ✅ Modeled via attention |
| Representation Type | Discrete counts | Dense embeddings | Deep contextual embeddings |
| Parameter Sharing | Minimal | Moderate | Extensive (layered attention blocks) |
| Intrinsic Metric (Perplexity / CE) | High | Moderate | Very low (masked CE) |
| Data Efficiency (Small Data) | Strong (with smoothing) | Moderate | Weak without pretraining |
| Scaling with Data | Plateaus early | Improves steadily | Follows scaling laws; improves predictably |
| Transfer Learning | ❌ None | Limited | ✅ Strong fine-tuning capability |
| Zero-shot Generalization | ❌ | ❌ | Limited (better in autoregressive LLMs than MLMs) |
| Robustness to OOV | ❌ Severe | Mitigated via embeddings | Subword tokenization mitigates strongly |
| Training Complexity | Counting | Backpropagation | Large-scale distributed optimization |
| Optimization Stability | Deterministic | Stable | Sensitive to LR schedules & scale |
| Parallelization (Training) | Trivial | Moderate | Highly parallelizable |
| Inference Parallelism | Trivial | Per-token sequential | Fully parallel across tokens (MLM) |
| Inference Latency | Very low | Low | Moderate–High |
| Memory Footprint (Parameters) | Large tables | Moderate | Very large (100M+) |
| Memory Footprint (Runtime) | Small | Moderate | High (activations + attention maps) |
| Time Complexity (Sequence length L) | ( \mathcal{O}(1) ) per token | ( \mathcal{O}(1) ) per token | ( \mathcal{O}(L^2) ) attention |
| Hardware Requirements | CPU sufficient | CPU/GPU | GPU/TPU required |
| Interpretability | High (explicit counts) | Moderate | Low (distributed representations) |
| Calibration | Often poorly calibrated | Better | Often miscalibrated without tuning |
| Domain Adaptation Cost | Low (re-count) | Moderate retraining | Fine-tuning required |
| Catastrophic Forgetting | Not applicable | Moderate | Significant without care |
| Deployment Simplicity | Very simple | Moderate | Complex (tokenizer, large weights) |
| Energy Consumption | Minimal | Moderate | Very high |
| Carbon Footprint (Training) | Negligible | Moderate | Substantial |
| Regulatory / Explainability Risk | Low | Medium | High |
| Best Use Cases | Embedded systems, ASR fallback | Legacy MT, moderate NLP | Search, QA, classification, semantic tasks |
| Failure Mode | Data sparsity | Context truncation | Compute bottlenecks & memory limits |
Table of Content