utterance: "I do uh main- mainly business data processing" (verbal redundancy)
type (V): an element of the vocabulary V
instance (N): an instance of that type in running text
"I'm" is orthographically one word, but grammatically two words
| Corpus | \mid V \mid (Types) | N (Instances) |
|---|---|---|
| Shakespeare | 31,000 | 884,000 |
| Brown Corpus | 38,000 | 1 million |
| Switchboard conversations | 20,000 | 2.4 million |
| COCA | 2 million | 440 million |
| Google N-grams | 13+ million | 1 trillion |
Heaps Law = Herdan's Law: |V| = kN^\beta (where \beta \approx 0.5)
Function words: of, the, is, and, una, 是
Content words: mango, braise, snowy, feliz, 北京

Morpheme: a minimal meaning-bearing unit in a language (cats: two morphemes cat and –s)
Morphology: the study of morphemes
root: central morpheme of the word (e.g. work, camera)
affix: adding additional meanings (e.g. -ed, -s)
Inflectional morphemes:
"-ed": past tense on verbs
"-s/-es": plural on nouns
Derivational morphemes
"care-ful": adjective
"care-ful-ly": adverb
Clitics: acts syntactically like a word (e.g.'ve in I've)
Morphological typology: a way of classifying the languages of the world that groups languages according to their common morphological structures.

Agglutinative languages: Very clean boundaries between morphemes
Fusion languages: a single affix may conflate multiple morphemes (English where "-s" means "third person" and "present tense", Russian)
ASCII: high bit set to 0, so 7 bits = 128 characters, but only 95 used. The rest were for teletypes.
Unicode: has 150,000 characters. use code point (1.1 million possible, but only 150,000 used) e.g. U+0061. First 127 code points = ASCII. A code point has no visuals; it is not a glyph (so emoji can be different in different OS)
The first 127 characters (ASCII) map to 1 byte
Most remaining characters in European, Middle
Eastern, and African scripts map to 2 bytes
Most Chinese, Japanese, and Korean characters map to 3 bytes
Rarer CJKV characters, emojis/symbols map to 4 bytes.
Python
lenis length of Unicode code point
UTF-8: directly store unicode waste storage space, so we use Variable Length Encoding
| Bytes | Byte pattern (binary) | Meaning |
|---|---|---|
| 1 | 0xxxxxxx |
ASCII (U+0000–U+007F) |
| 2 | 110xxxxx 10xxxxxx |
U+0080–U+07FF |
| 3 | 1110xxxx 10xxxxxx 10xxxxxx |
U+0800–U+FFFF |
| 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
U+10000–U+10FFFF |
Given a byte:
If it starts with 0 → 1-byte char.
If it starts with 110 → expect 1 continuation byte.
If it starts with 1110 → expect 2 continuation bytes.
If it starts with 11110 → expect 3 continuation bytes.
If it starts with 10 unexpectedly → it's in the middle of a multi-byte sequence (not a valid start).
This makes UTF-8 robust against errors — random access works and desync is recoverable: the nearest character boundary is always findable by moving only up to 3 bytes.
No such thing as a text file without an encoding
Tokenization:
white-space / orthographic words: undefined for many language, unbounded length
Unicode characters: too small
morphemes: hard to define
tokenize: deterministic, Eliminates the problem of unknown words
Byte-Pair Encoding (BPE): Iteratively merge frequent neighboring tokens to create longer tokens. need to specify k merges. Don't merge across white space. (whitespace attached to start of each string). Encoding in order we merge them.
Most BPE tokens used for English, leaving less for other languages
Words in other languages are often split up
Unigram language modeling tokenization (sentence piece)
Corpora
Language: English
Language Variety: African American English
Genre: Wikipedia
Author Demographics: writer's gender
Code Switching: use multiple languages in the same utterance
Regular Expression: re.search(pattern,string)
Pretokenization for BPE: r"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[ˆ\s\p{L}\p{N}]+|\s+(?This is quite a complex regular expression, and also makes use of some advanced Unicode-related features we haven’t described yet. These features are part of a pular external Python 3 library called regex (as opposed to the internal Python library called re)
The Python regex (as opposed to re) library has special \p and \P operators:
\p{L} matches any Unicode letter,
\P{L} matches any non-letter,
\p{N} matches any number,
\P{N} matches any non-number.
>>> import regex as re
>>> pat = re.compile(
... # Contractions: 't and 'm are tokens
... r"'s|'t|'re|'ve|'m|'ll|'d|"
... # Words: sequence of Unicode letters (after optional space)
... r" ?\p{L}+|"
... #Number: sequence of digits (after optional space)
... r" ?\p{N}+|"
... # Punctuation: sequence of non-alphanumeric/non-space
... #(after optional space)
... r" ?[ˆ\s\p{L}\p{N}]+|"
... # whitespace
... r"\s+(?!\S)|\s+"
... )
>>> text = "We're 350 dogs! Um, lunch?"
>>> print(pat.findall(text))
['We', "'re", ' 350', ' dogs', '!', ' Um', ',', ' lunch', '?']
>>>
SUPERBPE: runs a second stage of BPE, allowing merges across spaces and punctuation.

Rule-based tokenization: uncommon, but used for obtain grammatical words
Mostly but not always remove punctuation: "m.p.h., Ph.D., AT&T, cap’n", prices, dates, URL, hashtags, email
English 555,500.50 = French 555 500,50
Penn Treebank Tokenization Standard (Tokenization in NLTK):
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(?:-\w+)* # words with optional internal hyphens
... | \$?\d+(?:\.\d+)?%? # currency, percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():_`-] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
Common algorithm: Tokenize first: use rules or ML to classify a period as either (a) part of the word or (b) a sentence-boundary
Sentence segmentation can then often be done by rules based on this tokenization.
Space-based tokenization: segment, sort, and count
Edit Distance: Here
Table of Content