Read Illustrated Transformer. You can watch Video but it is quiet long (therefore not recommended).
Common NLP Tasks:
Token classification: classify each word into classes (remember token != work)
Masked language modeling: fill in the blank
collator
[MASK]
token, e.g., my dog is hairy -> my dog is [MASK]
[MASK]
ones. This is consistent with the goal of finetuning.Translation: BLEU score (geometric average of 1~4-gram where each gram is clipped precision of matching contiguous subsequence)
Summarization: ROUGE score
as_target_tokenizer
: used when label needs different tokenizer than inputCausal language modeling (regressive modeling): Perplexity and CrossEntropy
Question answering: context with a question, mark answer in text. F1 Score. stride
is number of overlapping token between 2 truncated sections (a truncated section might contain partial answer, in this case we mark as no answer, or no answer).
Transformers: traditionally, transformer means transformer module that is introduced in the original "Attention is all you need" paper. However, since this architecture cna be found in many models, we usually summarize those models by calling them "transformers"
Sequence to Sequence: a task, can be something like language translation or chat bot. Sequence to sequence is generally solved using an encoder and decoder. Encoder encodes every word into a sequence of embeddings in parallel where each word has context information associated with it. Decoder takes in the whole sequence of embeddings generated by the encoder as well as its own sequence of answer tokens to produce the next answer token.
Encoders-only Architecture: ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa. Trained on predicting masked words without causation assumption.
Decoder-only Architecture: CTRL, GPT, GPT-2, Transformer XL. The right word is masked compared to encoder. Can be used for sequence generation. If a generated sequence exceed model's maximum context size, model cannot remember first word it generated.
Encoder-decoder: BART, T5, Marian, mBART.
Loss: loss propagates through time. For RNN, if model predicts wrong, our next decoder input will have 50% chance of being ground truth instead of predicted token.
Tokenizers: tranlate word to numbers, a step before feeding into encoder. (Tokenizers are usually not trained using gradient descent, but rather deterministic rules based on heuristics. It will optimize for best word spliting based on data distribution.)
types: different tokenization methods
tokenization
to token
and ##ization
loading: we load two parts, algorithm (store code that splits to subwords, and add special token) and vocabulary (store data like dictionary mappings that translate subword string to numbers).
sequence batch: since we want to train in batches, model should support multiple inference. if length of sequences in a batch don't match, we zero-pad them and assign zero part with no attention. (also chunkate if our input is longer than what model expects) Batch might have different shape (might be slower on TPU, but faster on GPU).
sentence pair: since many dataset are about properties of a pair or sentences, we often add a field to training data to indicate which sentence is the first and which is the second (in ordered manner).
Model:
Accelerate: HuggingFace's distributed training integrated with PyTorch
In order to better understand the role of [CLS] let's recall that BERT model has been trained on 2 main tasks:
Masked language modeling: some random words are masked with [MASK] token, the model learns to predict those words during training. For that task we need the [MASK] token.
Next sentence prediction: given 2 sentences, the model learns to predict if the 2nd sentence is the real sentence, which follows the 1st sentence. For this task, we need another token, output of which will tell us how likely the current sentence is the next sentence of the 1st sentence. And here comes the [CLS]. You can think about the output of [CLS] as a probability.
Now you may ask the question: can we instead of using [CLS]'s output just outputting a number (as probability)? Yes, we can do that if the task of predicting next sentence is a separate task. However, BERT has been trained on both tasks simultaneously. Organizing inputs and outputs in such a format (with both [MASK] and [CLS]) will help BERT to learn both tasks at the same time and boost its performance.
When it comes to classification task (e.g. sentiment classification), as mentioned in other answers, the output of [CLS] can be helpful because it contains BERT's understanding at the sentence-level.
Answer by hoang tran on stackoverflow.
Here're my understandings:
(1)[CLS] appears at the very beginning of each sentence, it has a fixed embedding and a fix positional embedding, thus this token contains no information itself. (2)However, the output of [CLS] is inferred by all other words in this sentence, so [CLS] contains all information in other words.
This makes [CLS] a good representation for sentence-level classification.
Answer by BigMoyan on stackoverflow.
Model | BPE | WordPiece | Unigram |
---|---|---|---|
Training | Starts from a small vocabulary and learns rules to merge tokens | Starts from a small vocabulary and learns rules to merge tokens | Starts from a large vocabulary and learns rules to remove tokens |
Training step | Merges the tokens corresponding to the most common pair | Merges the tokens corresponding to the pair with the best score based on the frequency of the pair, privileging pairs where each individual token is less frequent | Removes all the tokens in the vocabulary that will minimize the loss computed on the whole corpus |
Learns | Merge rules and a vocabulary | Just a vocabulary | A vocabulary with a score for each token |
Encoding | Splits a word into characters and applies the merges learned during training | Finds the longest subword starting from the beginning that is in the vocabulary, then does the same for the rest of the word | Finds the most likely split into tokens, using the scores learned during training |
Traditional RNN: process one token at a time
encoder
decoder
In traditional RNN, the entire sequence correspond to one backprop. However, in transformers, every token in the sequence correspond to one backprop. (With the assumption
batch_size = 1
). This also solves the issue of diminishing gradient.
This video is very good in terms of explaining Transformers.
Transformers: feed x
(input) label to encoder, feed masked y
(ground truth) label to decoder
Positional Encoding: add data to establish ordering of sequence, since we no longer use RNN and process tokens in parallel. The position encoding is
Masked Input: different than CV, we actually send ground truth to decoder, but mask out the things that we want the model to predict. This is so that we can train transformer in parallel (without using what was predicted).
Attention: there are 3 input and 1 output. We dot
product Key
and Query
vector to select matching key (and softmax
exponentate for make pseudo differentiable selection), after dot
product and softmax
, you get a distribution with peat at selected area. (for first token, we copy one token to 3 inputs)
Multi-headed: Multiple attention heads in a single layer in a transformer is analogous to multiple kernels in a single layer in a CNN: they have the same architecture, and operate on the same feature-space, but since they are separate 'copies' with different sets of weights, they are hence 'free' to learn different functions.
Here is how you would implement and train transfromer from scratch with Pytorch.
BERT is a encoder-only transformer architecture, meaning it does [MASK]
prediction with no causation skip-connections.
Goal: By predicting [MASK]
, we can ensure that the embedding before [MASK]
has the meaning of the idx([MASK])
-th input word.
[CLS]
token: According to Here, although original paper suggests that [CLS]
embedding should be used for extracting sentence meanings, The original author noted "I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations". So [CLS]
should only be used for fine-tuning sentence-level tasks.
Table of Content