Lecture 002

utterance: "I do uh main- mainly business data processing" (verbal redundancy)

type (V): an element of the vocabulary V

instance (N): an instance of that type in running text

"I'm" is orthographically one word, but grammatically two words

Corpus \mid V \mid (Types) N (Instances)
Shakespeare 31,000 884,000
Brown Corpus 38,000 1 million
Switchboard conversations 20,000 2.4 million
COCA 2 million 440 million
Google N-grams 13+ million 1 trillion

Heaps Law = Herdan's Law: |V| = kN^\beta (where \beta \approx 0.5)

Function words: of, the, is, and, una, 是

Content words: mango, braise, snowy, feliz, 北京

Tria, Loreto, Servedio, 2018

Tria, Loreto, Servedio, 2018

Morpheme: a minimal meaning-bearing unit in a language (cats: two morphemes cat and –s)

Morphology: the study of morphemes

root: central morpheme of the word (e.g. work, camera)

affix: adding additional meanings (e.g. -ed, -s)

Inflectional morphemes:

Derivational morphemes

Clitics: acts syntactically like a word (e.g.'ve in I've)

Morphological typology: a way of classifying the languages of the world that groups languages according to their common morphological structures.

Morphemes per Word

Morphemes per Word

Agglutinative languages: Very clean boundaries between morphemes

Fusion languages: a single affix may conflate multiple morphemes (English where "-s" means "third person" and "present tense", Russian)

ASCII: high bit set to 0, so 7 bits = 128 characters, but only 95 used. The rest were for teletypes.

Unicode: has 150,000 characters. use code point (1.1 million possible, but only 150,000 used) e.g. U+0061. First 127 code points = ASCII. A code point has no visuals; it is not a glyph (so emoji can be different in different OS)

Python len is length of Unicode code point

UTF-8: directly store unicode waste storage space, so we use Variable Length Encoding

Bytes Byte pattern (binary) Meaning
1 0xxxxxxx ASCII (U+0000–U+007F)
2 110xxxxx 10xxxxxx U+0080–U+07FF
3 1110xxxx 10xxxxxx 10xxxxxx U+0800–U+FFFF
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U+10000–U+10FFFF

Given a byte:

This makes UTF-8 robust against errors — random access works and desync is recoverable: the nearest character boundary is always findable by moving only up to 3 bytes.

No such thing as a text file without an encoding

Tokenization:

Byte-Pair Encoding (BPE): Iteratively merge frequent neighboring tokens to create longer tokens. need to specify k merges. Don't merge across white space. (whitespace attached to start of each string). Encoding in order we merge them.

Unigram language modeling tokenization (sentence piece)

Corpora

Regular Expression: re.search(pattern,string)

Pretokenization for BPE: r"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[ˆ\s\p{L}\p{N}]+|\s+(?This is quite a complex regular expression, and also makes use of some advanced Unicode-related features we haven’t described yet. These features are part of a pular external Python 3 library called regex (as opposed to the internal Python library called re)

The Python regex (as opposed to re) library has special \p and \P operators:

>>> import regex as re
>>> pat = re.compile(
... # Contractions: 't and 'm are tokens
... r"'s|'t|'re|'ve|'m|'ll|'d|"
... # Words: sequence of Unicode letters (after optional space)
... r" ?\p{L}+|"
... #Number: sequence of digits (after optional space)
... r" ?\p{N}+|"
... # Punctuation: sequence of non-alphanumeric/non-space
... #(after optional space)
... r" ?[ˆ\s\p{L}\p{N}]+|"
... # whitespace
... r"\s+(?!\S)|\s+"
... )
>>> text = "We're 350 dogs! Um, lunch?"
>>> print(pat.findall(text))
['We', "'re", ' 350', ' dogs', '!', ' Um', ',', ' lunch', '?']
>>>

SUPERBPE: runs a second stage of BPE, allowing merges across spaces and punctuation.

SUPERBPE

SUPERBPE

Rule-based tokenization: uncommon, but used for obtain grammatical words

Penn Treebank Tokenization Standard (Tokenization in NLTK):

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(?:-\w+)* # words with optional internal hyphens
... | \$?\d+(?:\.\d+)?%? # currency, percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():_`-] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

Common algorithm: Tokenize first: use rules or ML to classify a period as either (a) part of the word or (b) a sentence-boundary

Sentence segmentation can then often be done by rules based on this tokenization.

Space-based tokenization: segment, sort, and count

Edit Distance: Here

Table of Content