Lecture 002

utterance: "I do uh main- mainly business data processing" (verbal redundancy)

type (V): an element of the vocabulary V

instance (N): an instance of that type in running text

"I'm" is orthographically one word, but grammatically two words

Corpus	$\mid V \mid$ (Types)	$N$ (Instances)
Shakespeare	31,000	884,000
Brown Corpus	38,000	1 million
Switchboard conversations	20,000	2.4 million
COCA	2 million	440 million
Google N-grams	13+ million	1 trillion

Heaps Law = Herdan's Law: $|V| = kN^\beta$ (where $\beta \approx 0.5$ )

Function words: of, the, is, and, una, 是

Content words: mango, braise, snowy, feliz, 北京

Morpheme: a minimal meaning-bearing unit in a language (cats: two morphemes cat and –s)

Morphology: the study of morphemes

root: central morpheme of the word (e.g. work, camera)

affix: adding additional meanings (e.g. -ed, -s)

Inflectional morphemes:

"-ed": past tense on verbs
"-s/-es": plural on nouns

Derivational morphemes

"care-ful": adjective
"care-ful-ly": adverb

Clitics: acts syntactically like a word (e.g.'ve in I've)

Morphological typology: a way of classifying the languages of the world that groups languages according to their common morphological structures.

Agglutinative languages: Very clean boundaries between morphemes

Fusion languages: a single affix may conflate multiple morphemes (English where "-s" means "third person" and "present tense", Russian)

ASCII: high bit set to 0, so 7 bits = 128 characters, but only 95 used. The rest were for teletypes.

Unicode: has 150,000 characters. use code point (1.1 million possible, but only 150,000 used) e.g. U+0061. First 127 code points = ASCII. A code point has no visuals; it is not a glyph (so emoji can be different in different OS)

The first 127 characters (ASCII) map to 1 byte
Most remaining characters in European, Middle
Eastern, and African scripts map to 2 bytes
Most Chinese, Japanese, and Korean characters map to 3 bytes
Rarer CJKV characters, emojis/symbols map to 4 bytes.

Python len is length of Unicode code point

UTF-8: directly store unicode waste storage space, so we use Variable Length Encoding

Bytes	Byte pattern (binary)	Meaning
1	`0xxxxxxx`	ASCII (U+0000–U+007F)
2	`110xxxxx 10xxxxxx`	U+0080–U+07FF
3	`1110xxxx 10xxxxxx 10xxxxxx`	U+0800–U+FFFF
4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	U+10000–U+10FFFF

Given a byte:

If it starts with 0 → 1-byte char.
If it starts with 110 → expect 1 continuation byte.
If it starts with 1110 → expect 2 continuation bytes.
If it starts with 11110 → expect 3 continuation bytes.
If it starts with 10 unexpectedly → it's in the middle of a multi-byte sequence (not a valid start).

This makes UTF-8 robust against errors — random access works and desync is recoverable: the nearest character boundary is always findable by moving only up to 3 bytes.

No such thing as a text file without an encoding

Tokenization:

white-space / orthographic words: undefined for many language, unbounded length
Unicode characters: too small
morphemes: hard to define
tokenize: deterministic, Eliminates the problem of unknown words

Byte-Pair Encoding (BPE): Iteratively merge frequent neighboring tokens to create longer tokens. need to specify $k$ merges. Don't merge across white space. (whitespace attached to start of each string). Encoding in order we merge them.

Most BPE tokens used for English, leaving less for other languages
Words in other languages are often split up

Unigram language modeling tokenization (sentence piece)

Corpora

Language: English
Language Variety: African American English
Genre: Wikipedia
Author Demographics: writer's gender
Code Switching: use multiple languages in the same utterance

Regular Expression: re.search(pattern,string)

r"regex": raw string treat backslashes as literal characters

Pretokenization for BPE: r"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[ˆ\s\p{L}\p{N}]+|\s+(?This is quite a complex regular expression, and also makes use of some advanced Unicode-related features we haven’t described yet. These features are part of a pular external Python 3 library called regex (as opposed to the internal Python library called re)

The Python regex (as opposed to re) library has special \p and \P operators:

\p{L} matches any Unicode letter,
\P{L} matches any non-letter,
\p{N} matches any number,
\P{N} matches any non-number.

>>> import regex as re
>>> pat = re.compile(
... # Contractions: 't and 'm are tokens
... r"'s|'t|'re|'ve|'m|'ll|'d|"
... # Words: sequence of Unicode letters (after optional space)
... r" ?\p{L}+|"
... #Number: sequence of digits (after optional space)
... r" ?\p{N}+|"
... # Punctuation: sequence of non-alphanumeric/non-space
... #(after optional space)
... r" ?[ˆ\s\p{L}\p{N}]+|"
... # whitespace
... r"\s+(?!\S)|\s+"
... )
>>> text = "We're 350 dogs! Um, lunch?"
>>> print(pat.findall(text))
['We', "'re", ' 350', ' dogs', '!', ' Um', ',', ' lunch', '?']
>>>

SUPERBPE: runs a second stage of BPE, allowing merges across spaces and punctuation.

Rule-based tokenization: uncommon, but used for obtain grammatical words

Mostly but not always remove punctuation: "m.p.h., Ph.D., AT&T, cap’n", prices, dates, URL, hashtags, email
English 555,500.50 = French 555 500,50

Penn Treebank Tokenization Standard (Tokenization in NLTK):

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(?:-\w+)* # words with optional internal hyphens
... | \$?\d+(?:\.\d+)?%? # currency, percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():_`-] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

Common algorithm: Tokenize first: use rules or ML to classify a period as either (a) part of the word or (b) a sentence-boundary

Sentence segmentation can then often be done by rules based on this tokenization.

Space-based tokenization: segment, sort, and count

Edit Distance: Here

Table of Content