LM

This article meant to be a summary of A Closer Look at Large Language Models Emergent Abilities. However, the original article is too good and I end up copy-pasting a lot of texts. You should read the original article if avaliable.

LM stands for Large Models, including large diffusion models and large language models.

A Closer Look at Large Language Models Emergent Abilities

So large language models are surprisingly good:

complex reasoning
reasoning with knowledge
out-of-distribution robustness

Some ability demonstration:

Questions:

The reason why large model perform well is unclear, we ask why.
And we ask if it is possible for small models.
We ask if "getting large" generalize to other models than NLP field

The Emergent Abilities That Exist in Large Models But not in Small Models.

Wei. et. al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. X-axis means model scale. GSM8K is a primary school level math problem dataset

Employing chain of thought prompting enables language models to solve arithmetic reasoning problems for which standard prompting has a mostly flat scaling curve.

Observe:

performance not increase much when model size is small
performance increase dramatically when model size is big

In Wei et. al. 2022. Emergent Abilities of Large Language Models we show emergence abilities not expected for small language models:

LLM can do string concatenation
LLM can do 3-digit addition

Abilities above are not interested to NLP researchers, since they can be solved by smaller specialized models for even classical algorithms.

Three Typical Examples of Emergent Abilities

Three types of abilities are interesting

Complex reasoning, where large models significantly outperform previous smaller models without the need for full-dataset training.
Reasoning with knowledge, where large models may not outperform previous smaller models, but do not require the additional source of knowledge (which can be expensive to obtain or hard to extract from unstructured data).
Out-of-distribution Robustness, where most previous fine-tuned models struggle. Here large models may still not outperform previous methods in the in-distribution setting, but they seem to be much better in the out-of-distribution setting.

Complex Reasoning

GSM8K: a large model proposed by OpenAI (Oct 2021) showed result after fine-tuned the first version of GPT3 with a verifier on the full training set. They achieve the accuracy about 35% (which is bad) from the tasks look like the following example:

Question:
Claire makes a 3 egg omelet every morning for breakfast. How many dozens of eggs will she eat in 4 weeks?

Claire makes a 3 egg omelet every morning.
There are 7 days in a week.
So in one week she will eat 3 * 7 = 21 eggs.
In 4 weeks she will eat 4 * 21 = 84 eggs.
There are 12 eggs in a dozen.
So 84 / 12 = 7.

The answer is 7.

Importantly, the author observed that the performance increase linearly as the model size increases exponentially. To reach 80% accuracy, we therefore need 17500B model that takes years to train.

Later: we see technology exponential increase

PaLM (540B, Jan 2022): 56.6% accuracy within only 8 chain-of-thought prompt examples
PaLM (540B, Mar 2022): 74.4% accuracy by majority voting
AI2 (175B, Nov 2022): 82.9% accuracy by complex chains of thought

We observe: - model larger than 100B required for CoT perform better than traditional answer-only. - CoT out performs better than all previous fine-tuning while only need 8 examples instead of full training set

Chain of thought (CoT): make the target text in the dataset so that the target text shows a chain of reasoning instead of just the answer.

it let the model to utilize more compute due to longer text to produce for the same text size
it guides the reasoning process by behavioral scaffolding (laddering)

Reasoning with Knowledge

In this setting, prompting large models does not necessarily outperform fine-tuning small models

small models: external corpora to retrieve from or use augmented data with multi-task learning
large models: rely solely on internal knowledge, no tuning

Storing knowledge for NLP is a fundamental problem. Usually, structured knowledge is hard to construct (because one needs to design the schema) but easy to reason with (because there are structures), unstructured knowledge is easy to construct (one just stores them) but hard to reason with (there are no structures ready to use). Language models, given this context, provide a way to simultaneously extract knowledge easily from unstructured text and reasoning upon the knowledge effectively without the need for a predefined schema.

Out-of-distribution Robustness

Prompting GPT-3 (big model) does not outperform the fine-tuned RoBERTa (smaller model) in the in-distribution setting. But it outperforms RoBERTa in three out-of-distribution (domain shift, noisy and adversarial perturbation) settings, meaning that it is more robust.

Complex prompts are better than simple prompts even for out-of-distribution.

Emergent Abilities Transcend the Scaling Law

Left: linear performance with exponentially increase in model size; Right: phase change - performance suddenly increase

In 2020: "Kaplan et. al. 2020. Scaling Laws for Neural Language Models" and GPT-3 original paper "Brown et. al. 2020. anguage Models are Few-Shot Learners." discussed the log-linear curve on the left.

In 2021: "Cobbe et. al. 2021. Training Verifiers to Solve Math Word Problems" suggest that log-linear scaling law also apply to fine-tuning.

Parameter-Efficient Adaptation: fine tuning or transfer learning with freezing some parameters or smaller dataset.

In-context Learning: fine tuning a general model to specific application domain.

So during these years, the community can't afford to fine-tune large models, so they either fine-tune or prompt small models: "if fine-tuning is better, we should put more effort into parameter-efficient tuning; if prompting is better, we should put more effort into training large language models."

Jan 2022: CoT comes out, which bring us the phase change.

What Does Paradigm Shift Mean

Paradigm Shift: we no longer need supervised learning and inefficient fine-tuning. Prompting large model is good.

Now, without fair comparison, most researcher believes that fine-tuning on large model should have similar result (if not worse) than prompting large models.

Hypothesis: Fine-tuning will improve in-distribution performance, but hurt out-of-distribution robustness. Prompting will be better in the distribution shift setting, but worse in the in-distribution setting.

However, Yang et. al. 2022 showed: for large model, it is different. Although Bart-based fintuning decrease OOD performance, Bart-large funetuning improve OOD performance!

There is no hard evidence which is better: fine-tuning or prompting.

How Large Should the Model Be?

Answer:

For chain-of-thought to be better than standard answer-only prompting, one needs the model to be at least 62B
For chain-of-thought to be better than fine-tuning small models (say T5-11B), one needs the model to be larger than 175B where the number 175B comes from GPT-3.

For all models smaller than 62B, direct prompting outperforms CoT.

Is Scale the Only Factor?

Scale is a necessary but not sufficient factor.

There are large model (OPT, BLOOM) that is made worse with CoT than traditional prompting. Most models >540B size can be improved with CoT, but not all.

There are only 2 publicly accessable models that have strong emergence: text-davinci-002 and code-davinci-002 (Codex) both are GPT-3.

all other GPT-3 models (original GPT-3, text-davinci-001, smaller GPT-3 models) can't outperform with CoT compared to either direct prompting or fine-tuning T5-11B
Strangely: code-davinci-002 (tuned on code) is consistently better than text-davinci-002 (tuned on language) in language! ("Suzgun. et. al. 2022. Challenging Big-Bench tasks and whether chain-of-thought can solve them", "Fu et. al. 2022. Complexity-Based Prompting for Multi-Step Reasoning", "Madaan et. al. 2022. Language Models of Code are Few-Shot Commonsense Learners")
PaLM models, including PaLM, U-PaLM, Flan-PaLM, and Minerva are not publically avaliable.

Source of emergence is unclear, but might be:

instruction tuning: GPT-3 text-davinci-002 is instruction-tuned with reinforcement learning ("Ouyang et. al. 2022. Training language models to follow instructions with human feedback"). Before that, text-davinci-001 could not do CoT well. It seems that PaLM is not instruction-tuned ("Chowdhery et. al. 2022. PaLM: Scaling Language Modeling with Pathways"), but later Google does ("Chung. et. al. 2022. Scaling Instruction-Finetuned Language Models"), and the performance increases.
tuning on code: Codex code-davinci-002, tuned on code, is consistently better than text-davinci-002. PaLM is also tuned on code. Code superficially has little to do with language, yet we don’t know why they help. But it seems that code is very helpful.
tuning on CoT: by the time text-davinci-002 was released, Google has already released PaLM for 3 months. So OpenAI should have information about the chain-of-thought. Also there are works showing directing tuning on CoT data ("Chung et. al. 2022. Scaling Instruction-Finetuned Language Models", "Huang et. al. 2022. Large Language Models Can Self-Improve") can enable the model’s CoT ability.

Prompt Engineering

Memory in LLM

Methods: LangChain

Conversation Buffer Memory: select exact previous message to pass into next prompt
Conversation Summary Memory: summarize previous memory using another LLM and pass in
Conversation Entity Memory: ask LLM to summarize an entity and put the summary into a dictionary mapping from entity to summary, LLM can choose to retrieve such memory
Conversation Knowledge Graph: more generalization of Conversation Entity Memory

Memory Reflection: a way to construct memory

reflect on message importance, store messages with importance score
ask LLM what question one can ask about current observation, then let LLM answer those questions

Memory Retrieval: a way to select stored memory

time-weighted: more recent memory are more important
important-weighted: look up important score stored by memory reflection
relevancy-weighted: vector database

Table of Content