


Procedural: (large variants on prompt number expected)

These 3 have different goals.

reinforcement fine-tuning enables:
reasoning model
function calling / tool use
multi-agent parallel collaboration


Other Memory Related Stuff:
user level skill / MCP (Model Context Protocol)
retrieval augmented generation (RAG)
Examples of Open Questions in Post Training:
Qwen2.5 pretrain data contamination with tool calling make random reward in RL increase performance
clipping in RL surpress low-probability token, resulting less exploration
group-queried attention save inference memory usage is huge (3x reduction)
The delta learning hypothesis: DPO preference (helpfulness / instruction following / truthfulness / honesty) is saturating in terms of binary "chosen/rejected", but the distance isn't.
RL skipping DPO tends to be worse
Other small trick bundle (https://youtu.be/uaZ3yRdYg8A?t=1614):
Zero gradient signal filtering
Active sampling (consistent batch size after zero gradient filtering, Yu et.al 2025)
Token-level Loss (normalize loss by total number of tokens across batch, rather than per-sample, Yu et.al 2025)
Remove KL Loss (allows less-restricted policy updates)
Clip Higher (upperbound clip slightly higher than lower bound, Yu et.al 2025)
Truncated importance sampling (Yao et.al 2025)
No standard deviation normalization (when calculating advantage Liu et.al 2025)

Automated Benchmarks: MMLU, OpenLLM Leaderboard (hard to evaluate complex task, data contamination)
Human Judge: vibe check, Chatbot Arena, Data Annotation (costly, biased, not scalable)
Judge LLM: LLM-as-judge, reward model, classifier (hidden bias)
Building a strong (non-leaking, secret) evaluation is key to successful RL fine-tuning. Facts: LLM companies hold their private benchmark because performance benchmark is very hackable by dataset companies while not providing actual performance increase.

Planning: Long horizon, autonomous model with planning capability.
Multimodal:
Next Token Prediction: pattern matching on text space
World Model: pattern matching on image space
Next Edit Prediction: pattern matching on action space (action in robots, cursor, command, requests, ...)
From a UI/UX perspective, next edit prediction is just based on ambient condition (cursor position, current state of the program) instead of explicit condition (e.g. text, image that user explicitely provides).
The goal is:
fast: provide suggestion before user can provide explicit condition
less accurate: trade accuracy for speed
transparent: "Good design is obvious, great design is transparent"
contain non-edit: "don't edit" is also a valid suggestion
Next Edit Prediction can be general:
code: next code you gonna type
blender: next blender command you UI gonna run
krita: next brush stroke you gonna make
premiere: next video trim you gonna do
We don't consider robots and browser for now because they have different challenges (e.g. they are more goal-oriented and requires longer planning)
Is NEP future proof? Assumptions:
NEP always has faster inference time than explicit condition prediction
providing explicit condition to "narrow down the distribution" is more expensive to user than "just sample a candidate from the distribution" (i.e. either "distribution is simple at the beginning" or "explicit condition can't narrow down the distribution much")
It has a different niche than explicit conditioned prediction. (Small model for specialized task)
All learning-based method can be reduced to:
data
compute
algorithm: "create non-existing data" or "make compute faster"
Since we should "finish" before "making it perfect", we would focus on data
Data comes in 3 sources:
actual data (webscraping, human annotation, synthetic data)
algorithm bias (model architecture / representation, regularization, data processing)
model weights (transfer learning, pretraining)
Most importantly, they are equally important in practice. However, in most literature do not focus on things other than "model architecture / representation".
"The model behaved overly cautiously—reluctant to touch unfinished code, hesitant to suggest changes to the line a user was typing, and often chose to do nothing. In practice, it performed worse than a vanilla LLM." (https://github.blog/ai-and-ml/github-copilot/evolving-github-copilots-next-edit-suggestions-through-custom-model-training/)
Pull request data wasn’t enough because it:
No Timing: Lacks temporal ordering, so the model can’t learn when changes happen
No "Don't Edit": Contains almost no negative samples (cases where the correct action is “don’t edit”)
Only Perfect Code: Misses abandoned edits, in-progress rewrites, and other common editing behavior
So they used data from code editing sessions (secret)
So they didn't change model architecture, but changed data to do supervised fine tuning from base model.
Limitations:
still can't teach model what is a bad edit
cannot utilize "unlabeled code samples" (which I guess refer to diff)
I guessed they used some synthetic data from "diff" data
grader design:
We use a large reasoning model with specific grading criteria.
We routinely analyze model outputs to update the grading criteria, constantly searching for new qualities that indicate unhelpful edits. (prompting?)
The grader should not only consider the correctness of the edit suggestion, but also strive to make the code diff displayed in the UI more easy to read (e.g. if there are two possible diff, prefer the one more semantic)
I guess they used unlabeled data to train the grader. This eliminates distribution drift from instruction tuning.
Inference technique:
cached tokens
Used LLM-based graders to filter out ambiguous or low-signal suggestions
Limitations:
Solution:
multi-model approach: train a separate "location model" to predict where to jump next (or no jump)
original NES model then generates the edit suggestion
repeat
carefully balance jump vs no-jump in dataset distribution (they didn't do right initially)
Limitations:
Solution:
It is unclear why they didn't have certain
instruction. Maybe they experimented that it will break pre-training knowledge since would break code pattern and mess up conext (or just for bigger context) I guess they used RL to treat "TAB completion steps" as "reasoning steps" so that it can use "diff" as "environment reward signal" to learn "what is a good edit" without explicit labeling.
Data Filter
Multiple Edit Chunks: A commit must contain at least two edit chunks. This ensures there is a history of at least one edit to serve as the resource for a subsequent edit.
Bounded Chunk Length: Each edit chunk must not exceed five lines. Overly long, complex changes fall outside this scope and are considered to be different types of interaction.
Limited Edit Scope: The total distance between the first edit chunk and the last line of the last edit chunk within a commit must not exceed 80 lines. This prevents the context from becoming excessively large and unfocused.
Additive Edits Only: To simplify the initial task formulation, we select only commits that consist exclusively of additive edits, excluding those with deletions.
Data Labeling
Trigger:
Input:
prefix: all code before cursor
suffix: some code after cursor
metadata: language, repo name, file name, indentation
Output:
Post Processing:
It doesn't use full completion output, instead:
It chunk output by separating between AST boundary (e.g. {}), double new line (empty line), and hard token limit (e.g. 100 tokens)
Very complicated post processing.
Document line: "foo(x, |y, z)" cursor at | (position.character = 7)
restOfLine = "y, z)"
LLM completion: "y=10, z=20)"
Greedy match:
'y' → indexOf('y', 0) = 0 suffixLength=1, lastIndex=0
',' → indexOf(',', 1) = 3 suffixLength=2, lastIndex=3
' ' → indexOf(' ', 4) = 4 suffixLength=3, lastIndex=4
'z' → indexOf('z', 5) = 6 suffixLength=4, lastIndex=6
')' → indexOf(')', 7) = 10 suffixLength=5, lastIndex=10
-> return 5 (= all of restOfLine)
Effect: VS Code replaces "y, z)" entirely, showing:
foo(x, y=10, z=20)
Abstractly, it separates "edit" output from "complete" output by split using AST boundaries.
Trigger:
Input:
5 recent edited document
cross-tab edit history
Output:
Speculative Request: When a NES is shown, the system immediately pre-computes the next NES assuming the user will accept. This makes chained suggestions feel instant.
While RL for reasoning might not add intelligence, it enable us to "train on non-data". Model development has shifted from "data mining" to "Verification Asymmetry mining". The cost of post training has surpassed pretraining in 2026.
Table of Content