NEP (Next Edit Prediction)

Where have we got here?

Pretraining

Data Mixture: train smaller model with mix, find out what mix is good

Post Training Pipeline

Procedural: (large variants on prompt number expected)

supervised/instruction fine-tuning (SFT/IFT): 1M prompts
preference fine-tuning (PreFT = RLHF/DPO): 1M (in-distribution) prompts (partial overlap with SFT)
reinforcement fine-tuning: 10~100K (verifiable: closing P=NP gap) prompts

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (https://arxiv.org/pdf/2504.13837) No?

These 3 have different goals.

Typical Data Budget (https://www.youtube.com/watch?v=1pmyTnGOevU). Learning a new language (non-fine-tuning) require 1B token.

reinforcement fine-tuning enables:

reasoning model
function calling / tool use
multi-agent parallel collaboration

PPO consistently outperforms DPO at the cost of implementation complexity, memory usage, throughput

Other Memory Related Stuff:

user level skill / MCP (Model Context Protocol)
retrieval augmented generation (RAG)

Examples of Open Questions in Post Training:

Qwen2.5 pretrain data contamination with tool calling make random reward in RL increase performance
clipping in RL surpress low-probability token, resulting less exploration
group-queried attention save inference memory usage is huge (3x reduction)
The delta learning hypothesis: DPO preference (helpfulness / instruction following / truthfulness / honesty) is saturating in terms of binary "chosen/rejected", but the distance isn't.
RL skipping DPO tends to be worse

Other small trick bundle (https://youtu.be/uaZ3yRdYg8A?t=1614):

Zero gradient signal filtering
Active sampling (consistent batch size after zero gradient filtering, Yu et.al 2025)
Token-level Loss (normalize loss by total number of tokens across batch, rather than per-sample, Yu et.al 2025)
Remove KL Loss (allows less-restricted policy updates)
Clip Higher (upperbound clip slightly higher than lower bound, Yu et.al 2025)
Truncated importance sampling (Yao et.al 2025)
No standard deviation normalization (when calculating advantage Liu et.al 2025)

Evaluation

Automated Benchmarks: MMLU, OpenLLM Leaderboard (hard to evaluate complex task, data contamination)

Human Judge: vibe check, Chatbot Arena, Data Annotation (costly, biased, not scalable)

Judge LLM: LLM-as-judge, reward model, classifier (hidden bias)

Building a strong (non-leaking, secret) evaluation is key to successful RL fine-tuning. Facts: LLM companies hold their private benchmark because performance benchmark is very hackable by dataset companies while not providing actual performance increase.

Future of LLM

Planning: Long horizon, autonomous model with planning capability.

Multimodal:

Next Token Prediction: pattern matching on text space
World Model: pattern matching on image space
Next Edit Prediction: pattern matching on action space (action in robots, cursor, command, requests, ...)

Next Edit Prediction

Why Next Edit Prediction?

From a UI/UX perspective, next edit prediction is just based on ambient condition (cursor position, current state of the program) instead of explicit condition (e.g. text, image that user explicitely provides).

The goal is:

fast: provide suggestion before user can provide explicit condition
less accurate: trade accuracy for speed
transparent: "Good design is obvious, great design is transparent"
contain non-edit: "don't edit" is also a valid suggestion

Next Edit Prediction can be general:

code: next code you gonna type
blender: next blender command you UI gonna run
krita: next brush stroke you gonna make
premiere: next video trim you gonna do

We don't consider robots and browser for now because they have different challenges (e.g. they are more goal-oriented and requires longer planning)

Is NEP future proof? Assumptions:

NEP always has faster inference time than explicit condition prediction
providing explicit condition to "narrow down the distribution" is more expensive to user than "just sample a candidate from the distribution" (i.e. either "distribution is simple at the beginning" or "explicit condition can't narrow down the distribution much")

It has a different niche than explicit conditioned prediction. (Small model for specialized task)

What is the fundamental issue?

All learning-based method can be reduced to:

data
compute
algorithm: "create non-existing data" or "make compute faster"

Since we should "finish" before "making it perfect", we would focus on data

Data comes in 3 sources:

actual data (webscraping, human annotation, synthetic data)
algorithm bias (model architecture / representation, regularization, data processing)
model weights (transfer learning, pretraining)

Most importantly, they are equally important in practice. However, in most literature do not focus on things other than "model architecture / representation".

"The model behaved overly cautiously—reluctant to touch unfinished code, hesitant to suggest changes to the line a user was typing, and often chose to do nothing. In practice, it performed worse than a vanilla LLM." (https://github.blog/ai-and-ml/github-copilot/evolving-github-copilots-next-edit-suggestions-through-custom-model-training/)

Pull request data wasn’t enough because it:

No Timing: Lacks temporal ordering, so the model can’t learn when changes happen
No "Don't Edit": Contains almost no negative samples (cases where the correct action is “don’t edit”)
Only Perfect Code: Misses abandoned edits, in-progress rewrites, and other common editing behavior

So they used data from code editing sessions (secret)

So they didn't change model architecture, but changed data to do supervised fine tuning from base model.

Limitations:

still can't teach model what is a bad edit
cannot utilize "unlabeled code samples" (which I guess refer to diff)

I guessed they used some synthetic data from "diff" data

grader design:

We use a large reasoning model with specific grading criteria.
We routinely analyze model outputs to update the grading criteria, constantly searching for new qualities that indicate unhelpful edits. (prompting?)
The grader should not only consider the correctness of the edit suggestion, but also strive to make the code diff displayed in the UI more easy to read (e.g. if there are two possible diff, prefer the one more semantic)

I guess they used unlabeled data to train the grader. This eliminates distribution drift from instruction tuning.

Inference technique:

cached tokens
Used LLM-based graders to filter out ambiguous or low-signal suggestions

Limitations:

no cross-file edits

Solution:

multi-model approach: train a separate "location model" to predict where to jump next (or no jump)
original NES model then generates the edit suggestion
repeat
carefully balance jump vs no-jump in dataset distribution (they didn't do right initially)

Limitations:

no use of "diff" data

Solution:

add RL: "Instead of relying solely on supervised labels, we added a grading signal based on how closely the model's predicted jump location matched the eventual cursor movement."

It is unclear why they didn't have certain instruction. Maybe they experimented that it will break pre-training knowledge since would break code pattern and mess up conext (or just for bigger context)

I guess they used RL to treat "TAB completion steps" as "reasoning steps" so that it can use "diff" as "environment reward signal" to learn "what is a good edit" without explicit labeling.

Actual Data

Data Filter

Multiple Edit Chunks: A commit must contain at least two edit chunks. This ensures there is a history of at least one edit to serve as the resource for a subsequent edit.
Bounded Chunk Length: Each edit chunk must not exceed five lines. Overly long, complex changes fall outside this scope and are considered to be different types of interaction.
Limited Edit Scope: The total distance between the first edit chunk and the last line of the last edit chunk within a commit must not exceed 80 lines. This prevents the context from becoming excessively large and unfocused.
Additive Edits Only: To simplify the initial task formulation, we select only commits that consist exclusively of additive edits, excluding those with deletions.

Data Labeling

Edit Continuation: classify if the final edit is a logical continuation of the preceding ones

Github Copilot

Inline Completion

Trigger:

75ms after type or idle

Input:

prefix: all code before cursor
suffix: some code after cursor
metadata: language, repo name, file name, indentation

Output:

code in the middle, and sometimes overlaps with suffix (but separated by double new line or AST boundary)

Post Processing:

It doesn't use full completion output, instead:
It chunk output by separating between AST boundary (e.g. {}), double new line (empty line), and hard token limit (e.g. 100 tokens)
Very complicated post processing.

Document line: "foo(x, |y, z)"    cursor at | (position.character = 7)
restOfLine = "y, z)"

LLM completion: "y=10, z=20)"

Greedy match:
  'y' → indexOf('y', 0) = 0    suffixLength=1, lastIndex=0
  ',' → indexOf(',', 1) = 3    suffixLength=2, lastIndex=3
  ' ' → indexOf(' ', 4) = 4    suffixLength=3, lastIndex=4
  'z' → indexOf('z', 5) = 6    suffixLength=4, lastIndex=6
  ')' → indexOf(')', 7) = 10   suffixLength=5, lastIndex=10
-> return 5 (= all of restOfLine)

Effect: VS Code replaces "y, z)" entirely, showing:
  foo(x, y=10, z=20)

Abstractly, it separates "edit" output from "complete" output by split using AST boundaries.

Next Edit Prediction

Trigger:

edit -> cursor move -> idle

Input:

5 recent edited document
- all code
- recent edit diff
  - StringEdit[] = List[[start, end)]
  - LineEdit[] = List[[start, end)]
  - Only edits from the last 10 minutes are kept
  - Composed edit must not exceed 100 lines of changes
  - Individual edits can't exceed 5000 inserted/deleted characters
  - Maximum ~5 line replacements in the composed result
- cursor position
cross-tab edit history

Output:

Dict[targetDocument : edit]

Speculative Request: When a NES is shown, the system immediately pre-computes the next NES assuming the user will accept. This makes chained suggestions feel instant.

Future of NEP is RL

While RL for reasoning might not add intelligence, it enable us to "train on non-data". Model development has shifted from "data mining" to "Verification Asymmetry mining". The cost of post training has surpassed pretraining in 2026.

Table of Content