LM

This article meant to be a summary of A Closer Look at Large Language Models Emergent Abilities. However, the original article is too good and I end up copy-pasting a lot of texts. You should read the original article if avaliable.

LM stands for Large Models, including large diffusion models and large language models.

A Closer Look at Large Language Models Emergent Abilities

So large language models are surprisingly good:

Some ability demonstration:

Questions:

The Emergent Abilities That Exist in Large Models But not in Small Models.

Wei. et. al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. X-axis means model scale. GSM8K is a primary school level math problem dataset

Wei. et. al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. X-axis means model scale. GSM8K is a primary school level math problem dataset

Employing chain of thought prompting enables language models to solve arithmetic reasoning problems for which standard prompting has a mostly flat scaling curve.

Employing chain of thought prompting enables language models to solve arithmetic reasoning problems for which standard prompting has a mostly flat scaling curve.

Observe:

  1. performance not increase much when model size is small
  2. performance increase dramatically when model size is big

In Wei et. al. 2022. Emergent Abilities of Large Language Models we show emergence abilities not expected for small language models:

Abilities above are not interested to NLP researchers, since they can be solved by smaller specialized models for even classical algorithms.

Three Typical Examples of Emergent Abilities

Three types of abilities are interesting

Complex Reasoning

GSM8K: a large model proposed by OpenAI (Oct 2021) showed result after fine-tuned the first version of GPT3 with a verifier on the full training set. They achieve the accuracy about 35% (which is bad) from the tasks look like the following example:

Question:
Claire makes a 3 egg omelet every morning for breakfast. How many dozens of eggs will she eat in 4 weeks?

Claire makes a 3 egg omelet every morning.
There are 7 days in a week.
So in one week she will eat 3 * 7 = 21 eggs.
In 4 weeks she will eat 4 * 21 = 84 eggs.
There are 12 eggs in a dozen.
So 84 / 12 = 7.

The answer is 7.

Importantly, the author observed that the performance increase linearly as the model size increases exponentially. To reach 80% accuracy, we therefore need 17500B model that takes years to train.

Later: we see technology exponential increase

SOTA Methods

SOTA Methods

We observe: - model larger than 100B required for CoT perform better than traditional answer-only. - CoT out performs better than all previous fine-tuning while only need 8 examples instead of full training set

Chain of thought (CoT): make the target text in the dataset so that the target text shows a chain of reasoning instead of just the answer.

Reasoning with Knowledge

In this setting, prompting large models does not necessarily outperform fine-tuning small models

Storing knowledge for NLP is a fundamental problem. Usually, structured knowledge is hard to construct (because one needs to design the schema) but easy to reason with (because there are structures), unstructured knowledge is easy to construct (one just stores them) but hard to reason with (there are no structures ready to use). Language models, given this context, provide a way to simultaneously extract knowledge easily from unstructured text and reasoning upon the knowledge effectively without the need for a predefined schema.

Construction and Reasoning

Construction and Reasoning

Out-of-distribution Robustness

Prompting GPT-3 (big model) does not outperform the fine-tuned RoBERTa (smaller model) in the in-distribution setting. But it outperforms RoBERTa in three out-of-distribution (domain shift, noisy and adversarial perturbation) settings, meaning that it is more robust.

Complex prompts are better than simple prompts even for out-of-distribution.

Complex prompts are better than simple prompts even for out-of-distribution.

Emergent Abilities Transcend the Scaling Law

Left: linear performance with exponentially increase in model size; Right: phase change - performance suddenly increase

Left: linear performance with exponentially increase in model size; Right: phase change - performance suddenly increase

In 2020: "Kaplan et. al. 2020. Scaling Laws for Neural Language Models" and GPT-3 original paper "Brown et. al. 2020. anguage Models are Few-Shot Learners." discussed the log-linear curve on the left.

In 2021: "Cobbe et. al. 2021. Training Verifiers to Solve Math Word Problems" suggest that log-linear scaling law also apply to fine-tuning.

Parameter-Efficient Adaptation: fine tuning or transfer learning with freezing some parameters or smaller dataset.

In-context Learning: fine tuning a general model to specific application domain.

So during these years, the community can't afford to fine-tune large models, so they either fine-tune or prompt small models: "if fine-tuning is better, we should put more effort into parameter-efficient tuning; if prompting is better, we should put more effort into training large language models."

Jan 2022: CoT comes out, which bring us the phase change.

What Does Paradigm Shift Mean

Paradigm Shift: we no longer need supervised learning and inefficient fine-tuning. Prompting large model is good.

Paradigm Shift: we no longer need supervised learning and inefficient fine-tuning. Prompting large model is good.

Now, without fair comparison, most researcher believes that fine-tuning on large model should have similar result (if not worse) than prompting large models.

Hypothesis: Fine-tuning will improve in-distribution performance, but hurt out-of-distribution robustness. Prompting will be better in the distribution shift setting, but worse in the in-distribution setting.

However, Yang et. al. 2022 showed: for large model, it is different. Although Bart-based fintuning decrease OOD performance, Bart-large funetuning improve OOD performance!

There is no hard evidence which is better: fine-tuning or prompting.

How Large Should the Model Be?

Answer:

For all models smaller than 62B, direct prompting outperforms CoT.

Is Scale the Only Factor?

Scale is a necessary but not sufficient factor.

There are large model (OPT, BLOOM) that is made worse with CoT than traditional prompting. Most models >540B size can be improved with CoT, but not all.

There are only 2 publicly accessable models that have strong emergence: text-davinci-002 and code-davinci-002 (Codex) both are GPT-3.

Source of emergence is unclear, but might be:

Prompt Engineering

Memory in LLM

Methods: LangChain

Memory Reflection: a way to construct memory

Memory Retrieval: a way to select stored memory

Table of Content