NLP

Read Illustrated Transformer. You can watch Video but it is quiet long (therefore not recommended).

Common NLP Tasks:

Token classification: classify each word into classes (remember token != work)
Masked language modeling: fill in the blank
- Perplexity (exponentiated perplexity) or CrossEntropy, or more sophisticated metrics BLEU or ROUGE, since blank can be filled with multiple correct answer
- To deal with variable text length, either: zero pad to the longest, chunkate long text, or concat everything together with a split token indicate context transition (best)
- good for fine tuning encoder before addapting it to other tasks
- BERT's random masking: handeled by collator
  - 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy -> my dog is [MASK]
  - 10% of the time: Replace the word with a random word, e.g., my dog is hairy -> my dog is apple. This will force model to generate proper contextual embedding for all tokens in the sequence, not only the [MASK] ones. This is consistent with the goal of finetuning.
  - 10% of the time: Keep the word unchanged, e.g., my dog is hairy -> my dog is hairy. The purpose of this is to bias the representation towards the actual observed word. Otherwise, the model will only look at the context to define the word, and ever look at the word itself.
Translation: BLEU score (geometric average of 1~4-gram where each gram is clipped precision of matching contiguous subsequence)
- BLEU Score: widely used and simple, but
  - doesn't consider meaning
  - doesn't consider sentence structure
  - may not work for non-English
  - hard to compare with different tokenizers
- SacreBLEU: built-in internal tokenizers for consistent comparison
- Tokenizer: must use different tokenizer (and padding rule) if language is different
Summarization: ROUGE score
- ROUGE1-F1: ROUGE1 precision and recall of matching
- ROUGE1, ROUGE2: 1 and 2 gram version with recall only
- ROUGE-L: longest common sequence (might not be contiguous) in a sentence, capture sentence structure better. Averaged for each sentence.
- ROUGE-LSum: longest common sequence in the whole summary
- as_target_tokenizer: used when label needs different tokenizer than input
Causal language modeling (regressive modeling): Perplexity and CrossEntropy
Question answering: context with a question, mark answer in text. F1 Score. stride is number of overlapping token between 2 truncated sections (a truncated section might contain partial answer, in this case we mark as no answer, or no answer).

Transformers: traditionally, transformer means transformer module that is introduced in the original "Attention is all you need" paper. However, since this architecture cna be found in many models, we usually summarize those models by calling them "transformers"

Sequence to Sequence: a task, can be something like language translation or chat bot. Sequence to sequence is generally solved using an encoder and decoder. Encoder encodes every word into a sequence of embeddings in parallel where each word has context information associated with it. Decoder takes in the whole sequence of embeddings generated by the encoder as well as its own sequence of answer tokens to produce the next answer token.

RNN Attention: instead sending encoded sequence once, we send many times for each next-token prediction to the decoder.

Encoders-only Architecture: ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa. Trained on predicting masked words without causation assumption.

Decoder-only Architecture: CTRL, GPT, GPT-2, Transformer XL. The right word is masked compared to encoder. Can be used for sequence generation. If a generated sequence exceed model's maximum context size, model cannot remember first word it generated.

Encoder-decoder: BART, T5, Marian, mBART.

Loss: loss propagates through time. For RNN, if model predicts wrong, our next decoder input will have 50% chance of being ground truth instead of predicted token.

Tokenizers: tranlate word to numbers, a step before feeding into encoder. (Tokenizers are usually not trained using gradient descent, but rather deterministic rules based on heuristics. It will optimize for best word spliting based on data distribution.)

types: different tokenization methods
- word based: split space. bad for unknown word (nobody actually use it)
- character based: split characters. bad for Chinese and lengthened input (nobody actually use it)
- subword: decompose tokenization to token and ##ization
  - WordPiece: used by BERT and DistilBERT
  - Unigram: used by XLNet and ALBERT
  - Byte-Pair Encoding: used by GPT-2, RoBERTa
loading: we load two parts, algorithm (store code that splits to subwords, and add special token) and vocabulary (store data like dictionary mappings that translate subword string to numbers).
sequence batch: since we want to train in batches, model should support multiple inference. if length of sequences in a batch don't match, we zero-pad them and assign zero part with no attention. (also chunkate if our input is longer than what model expects) Batch might have different shape (might be slower on TPU, but faster on GPU).
sentence pair: since many dataset are about properties of a pair or sentences, we often add a field to training data to indicate which sentence is the first and which is the second (in ordered manner).

Model:

head: An additional component, usually made up of one or a few layers, to convert the transformer predictions to a task-specific output

Accelerate: HuggingFace's distributed training integrated with PyTorch

In order to better understand the role of [CLS] let's recall that BERT model has been trained on 2 main tasks:

Masked language modeling: some random words are masked with [MASK] token, the model learns to predict those words during training. For that task we need the [MASK] token.

Next sentence prediction: given 2 sentences, the model learns to predict if the 2nd sentence is the real sentence, which follows the 1st sentence. For this task, we need another token, output of which will tell us how likely the current sentence is the next sentence of the 1st sentence. And here comes the [CLS]. You can think about the output of [CLS] as a probability.

Now you may ask the question: can we instead of using [CLS]'s output just outputting a number (as probability)? Yes, we can do that if the task of predicting next sentence is a separate task. However, BERT has been trained on both tasks simultaneously. Organizing inputs and outputs in such a format (with both [MASK] and [CLS]) will help BERT to learn both tasks at the same time and boost its performance.

When it comes to classification task (e.g. sentiment classification), as mentioned in other answers, the output of [CLS] can be helpful because it contains BERT's understanding at the sentence-level.

Answer by hoang tran on stackoverflow.

Here're my understandings:

(1)[CLS] appears at the very beginning of each sentence, it has a fixed embedding and a fix positional embedding, thus this token contains no information itself. (2)However, the output of [CLS] is inferred by all other words in this sentence, so [CLS] contains all information in other words.

This makes [CLS] a good representation for sentence-level classification.

Answer by BigMoyan on stackoverflow.

Model	BPE	WordPiece	Unigram
Training	Starts from a small vocabulary and learns rules to merge tokens	Starts from a small vocabulary and learns rules to merge tokens	Starts from a large vocabulary and learns rules to remove tokens
Training step	Merges the tokens corresponding to the most common pair	Merges the tokens corresponding to the pair with the best score based on the frequency of the pair, privileging pairs where each individual token is less frequent	Removes all the tokens in the vocabulary that will minimize the loss computed on the whole corpus
Learns	Merge rules and a vocabulary	Just a vocabulary	A vocabulary with a score for each token
Encoding	Splits a word into characters and applies the merges learned during training	Finds the longest subword starting from the beginning that is in the vocabulary, then does the same for the rest of the word	Finds the most likely split into tokens, using the scores learned during training

To predict the translation of "eats", traditional RNN have a long data flow (in red). The intuition is to short cut dataflow by allowing decoder to select and read relevent hidden states generated by encoders directly (in green).

Traditional RNN: process one token at a time

encoder
- input: current word vector, previous word embedding
- output: current word embedding
decoder
- input: last word embedding, or generated previous word embedding
- output: generated word vector, generated next word embedding

In traditional RNN, the entire sequence correspond to one backprop. However, in transformers, every token in the sequence correspond to one backprop. (With the assumption batch_size = 1). This also solves the issue of diminishing gradient.

This video is very good in terms of explaining Transformers.

Transformers: feed x (input) label to encoder, feed masked y (ground truth) label to decoder

Positional Encoding: add data to establish ordering of sequence, since we no longer use RNN and process tokens in parallel. The position encoding is
Masked Input: different than CV, we actually send ground truth to decoder, but mask out the things that we want the model to predict. This is so that we can train transformer in parallel (without using what was predicted).
Attention: there are 3 input and 1 output. We dot product Key and Query vector to select matching key (and softmax exponentate for make pseudo differentiable selection), after dot product and softmax, you get a distribution with peat at selected area. (for first token, we copy one token to 3 inputs)
- Value: embeddings generated by encoder
- Key: "key" generated by encoder
- Query: "query" generated by decoder when looking at "what we generated so far"
Multi-headed: Multiple attention heads in a single layer in a transformer is analogous to multiple kernels in a single layer in a CNN: they have the same architecture, and operate on the same feature-space, but since they are separate 'copies' with different sets of weights, they are hence 'free' to learn different functions.

Here is how you would implement and train transfromer from scratch with Pytorch.

BERT

BERT is a encoder-only transformer architecture, meaning it does [MASK] prediction with no causation skip-connections.

Goal: By predicting [MASK], we can ensure that the embedding before [MASK] has the meaning of the idx([MASK])-th input word.
[CLS] token: According to Here, although original paper suggests that [CLS] embedding should be used for extracting sentence meanings, The original author noted "I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations". So [CLS] should only be used for fine-tuning sentence-level tasks.

Table of Content