Why:
Instruction Alignment: easily parse, reduce inference time
Task Alignment: consider all options in generating a response. explain reason.
NNSearch: (1) Inverted File Index (2) Hierarchical Navigable Small Worlds (3) RAGatouille // TODO
RAG with abbreviation injection
Textual Entailment Recognition: given two text fragments, determine whether the meaning of one text is entailed (can be inferred) from the other text. (no neutral case in our problem)
Failure Mode: no retrival, The scoring system currently cannot select options like "None of the above". Due to evaluation.
MT Problems: (1) Lexical divergences: no one-to-one mapping in word meaning (2) Structural divergences: Syntax, word order; Syntax-semantics relationship
Solution: linking words (If a word in the target frequently co-occurs with a word in the source, these will be, over several iterations, aligned with relatively greater frequency)
BLEU scores are based on token ngram overlap
Very sensitive to tokenization
Unnecessarily complicated
Doesn’t correlate with human judgments as well as simpler metrics
A better alternative: chrF (character F-score)
chrF: A good machine translation will tend to contain characters and words that occur in a human translation of the same sentence. Correlates with human judgments quite well while being robust to tokenization difference
Precision-chrP: percentage of character 1-grams, 2-grams, ..., k-grams in the hypothesis that occur in the reference, averaged.
Recall-chrR: percentage of character 1-grams, 2-grams,..., k-grams in the reference that occur in the hypothesis, averaged.
k=6, chrF_beta = (1 + beta^2) * (chrP * chrR) / (beta^2 * chrP + chrR)

RNNs:
The n-gram LM: Context size is the n − 1 prior words we condi3on on.
The feedforward LM: Context is the window size.
The RNN LM: No fixed context size; ht-1 represents en3re history
Weight tying: merge embedding lookup with final weight before logit
Teacher forcing: ground truth


LSTM: vanishing gradient backprop too far
removing info no longer needed from the context,
adding info likely to be needed for later decision making

Closed class words: function words - short, frequent words with grammatical function
determiners: a, an, the
pronouns: she, he, I
prepositions: on, under, over, near, by, ...
Open class words: content words: Nouns, Verbs, Adjectives, Adverbs

Supervised ML for Part of Speech Tagging:
Why: tts, MT, parsing, sentiment, language-analytic computational tasks
Hidden Markov Models
Conditional Random Fields (CRF)/ Maximum Entropy Markov Models (MEMM)
Neural sequence models (RNNs or Transformers)
Large Language Models (like BERT), finetuned
Named Entity Recognition (NER): PER (person), LOC (location), ORG (organization), GPE (geo-political entity)
BIO tagging: begin, inside, outside, need 2n+1 tags for n entity types
Algorithm:
Hidden Markov Models
Conditional Random Fields (CRF)/ Maximum Entropy Markov Models (MEMM)
Neural sequence models (RNNs or Transformers)
Large Language Models (like BERT), finetuned
Table of Content