NLP

Read Illustrated Transformer. You can watch Video but it is quiet long (therefore not recommended).

Common NLP Tasks:

Transformers: traditionally, transformer means transformer module that is introduced in the original "Attention is all you need" paper. However, since this architecture cna be found in many models, we usually summarize those models by calling them "transformers"

Sequence to Sequence: a task, can be something like language translation or chat bot. Sequence to sequence is generally solved using an encoder and decoder. Encoder encodes every word into a sequence of embeddings in parallel where each word has context information associated with it. Decoder takes in the whole sequence of embeddings generated by the encoder as well as its own sequence of answer tokens to produce the next answer token.

Encoders-only Architecture: ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa. Trained on predicting masked words without causation assumption.

Decoder-only Architecture: CTRL, GPT, GPT-2, Transformer XL. The right word is masked compared to encoder. Can be used for sequence generation. If a generated sequence exceed model's maximum context size, model cannot remember first word it generated.

Encoder-decoder: BART, T5, Marian, mBART.

Loss: loss propagates through time. For RNN, if model predicts wrong, our next decoder input will have 50% chance of being ground truth instead of predicted token.

Tokenization Pipeline

Tokenization Pipeline

Tokenizers: tranlate word to numbers, a step before feeding into encoder. (Tokenizers are usually not trained using gradient descent, but rather deterministic rules based on heuristics. It will optimize for best word spliting based on data distribution.)

Model:

Accelerate: HuggingFace's distributed training integrated with PyTorch

In order to better understand the role of [CLS] let's recall that BERT model has been trained on 2 main tasks:

Masked language modeling: some random words are masked with [MASK] token, the model learns to predict those words during training. For that task we need the [MASK] token.

Next sentence prediction: given 2 sentences, the model learns to predict if the 2nd sentence is the real sentence, which follows the 1st sentence. For this task, we need another token, output of which will tell us how likely the current sentence is the next sentence of the 1st sentence. And here comes the [CLS]. You can think about the output of [CLS] as a probability.

Now you may ask the question: can we instead of using [CLS]'s output just outputting a number (as probability)? Yes, we can do that if the task of predicting next sentence is a separate task. However, BERT has been trained on both tasks simultaneously. Organizing inputs and outputs in such a format (with both [MASK] and [CLS]) will help BERT to learn both tasks at the same time and boost its performance.

When it comes to classification task (e.g. sentiment classification), as mentioned in other answers, the output of [CLS] can be helpful because it contains BERT's understanding at the sentence-level.

Answer by hoang tran on stackoverflow.

Here're my understandings:

(1)[CLS] appears at the very beginning of each sentence, it has a fixed embedding and a fix positional embedding, thus this token contains no information itself. (2)However, the output of [CLS] is inferred by all other words in this sentence, so [CLS] contains all information in other words.

This makes [CLS] a good representation for sentence-level classification.

Answer by BigMoyan on stackoverflow.

Model BPE WordPiece Unigram
Training Starts from a small vocabulary and learns rules to merge tokens Starts from a small vocabulary and learns rules to merge tokens Starts from a large vocabulary and learns rules to remove tokens
Training step Merges the tokens corresponding to the most common pair Merges the tokens corresponding to the pair with the best score based on the frequency of the pair, privileging pairs where each individual token is less frequent Removes all the tokens in the vocabulary that will minimize the loss computed on the whole corpus
Learns Merge rules and a vocabulary Just a vocabulary A vocabulary with a score for each token
Encoding Splits a word into characters and applies the merges learned during training Finds the longest subword starting from the beginning that is in the vocabulary, then does the same for the rest of the word Finds the most likely split into tokens, using the scores learned during training

Traditional RNN Dataflow

Traditional RNN Dataflow

To predict the translation of "eats", traditional RNN have a long data flow (in red). The intuition is to short cut dataflow by allowing decoder to select and read relevent hidden states generated by encoders directly (in green).

To predict the translation of "eats", traditional RNN have a long data flow (in red). The intuition is to short cut dataflow by allowing decoder to select and read relevent hidden states generated by encoders directly (in green).

Traditional RNN: process one token at a time

In traditional RNN, the entire sequence correspond to one backprop. However, in transformers, every token in the sequence correspond to one backprop. (With the assumption batch_size = 1). This also solves the issue of diminishing gradient.

This video is very good in terms of explaining Transformers.

Transformers

Transformers

Key, Value, and Queries

Key, Value, and Queries

Attention

Attention

Data Flow in Attention

Data Flow in Attention

Transformers: feed x (input) label to encoder, feed masked y (ground truth) label to decoder

Here is how you would implement and train transfromer from scratch with Pytorch.

BERT

BERT is a encoder-only transformer architecture, meaning it does [MASK] prediction with no causation skip-connections.

Table of Content