NEP (Next Edit Prediction)

Next Edit Prediction

Next Edit Prediction

Where have we got here?

Pretraining

Data Mixture: train smaller model with mix, find out what mix is good

Data Mixture: train smaller model with mix, find out what mix is good

Post Training Pipeline

Post Training Pipeline

Post Training Pipeline

Procedural: (large variants on prompt number expected)

  1. supervised/instruction fine-tuning (SFT/IFT): 1M prompts
  2. preference fine-tuning (PreFT = RLHF/DPO): 1M (in-distribution) prompts (partial overlap with SFT)
  3. reinforcement fine-tuning: 10~100K (verifiable: closing P=NP gap) prompts

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (https://arxiv.org/pdf/2504.13837) No?

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (https://arxiv.org/pdf/2504.13837) No?

These 3 have different goals.

Typical Data Budget (https://www.youtube.com/watch?v=1pmyTnGOevU). Learning a new language (non-fine-tuning) require 1B token.

Typical Data Budget (https://www.youtube.com/watch?v=1pmyTnGOevU). Learning a new language (non-fine-tuning) require 1B token.

reinforcement fine-tuning enables:

Choosing DPO, PPO, GRPO in RLHF

Choosing DPO, PPO, GRPO in RLHF

PPO consistently outperforms DPO at the cost of implementation complexity, memory usage, throughput

PPO consistently outperforms DPO at the cost of implementation complexity, memory usage, throughput

Other Memory Related Stuff:

Examples of Open Questions in Post Training:

Other small trick bundle (https://youtu.be/uaZ3yRdYg8A?t=1614):

Loss function after tricks

Loss function after tricks

Evaluation

Automated Benchmarks: MMLU, OpenLLM Leaderboard (hard to evaluate complex task, data contamination)

Human Judge: vibe check, Chatbot Arena, Data Annotation (costly, biased, not scalable)

Judge LLM: LLM-as-judge, reward model, classifier (hidden bias)

Building a strong (non-leaking, secret) evaluation is key to successful RL fine-tuning. Facts: LLM companies hold their private benchmark because performance benchmark is very hackable by dataset companies while not providing actual performance increase.

Total Research Compute

Total Research Compute

Future of LLM

Planning: Long horizon, autonomous model with planning capability.

Multimodal:

Next Edit Prediction

Why Next Edit Prediction?

From a UI/UX perspective, next edit prediction is just based on ambient condition (cursor position, current state of the program) instead of explicit condition (e.g. text, image that user explicitely provides).

The goal is:

Next Edit Prediction can be general:

We don't consider robots and browser for now because they have different challenges (e.g. they are more goal-oriented and requires longer planning)

Is NEP future proof? Assumptions:

It has a different niche than explicit conditioned prediction. (Small model for specialized task)

What is the fundamental issue?

All learning-based method can be reduced to:

Since we should "finish" before "making it perfect", we would focus on data

Data comes in 3 sources:

Most importantly, they are equally important in practice. However, in most literature do not focus on things other than "model architecture / representation".

"The model behaved overly cautiously—reluctant to touch unfinished code, hesitant to suggest changes to the line a user was typing, and often chose to do nothing. In practice, it performed worse than a vanilla LLM." (https://github.blog/ai-and-ml/github-copilot/evolving-github-copilots-next-edit-suggestions-through-custom-model-training/)

Pull request data wasn’t enough because it:

So they used data from code editing sessions (secret)

So they didn't change model architecture, but changed data to do supervised fine tuning from base model.

Limitations:

I guessed they used some synthetic data from "diff" data

grader design:

I guess they used unlabeled data to train the grader. This eliminates distribution drift from instruction tuning.

Inference technique:

Limitations:

Solution:

Limitations:

Solution:

It is unclear why they didn't have certain instruction. Maybe they experimented that it will break pre-training knowledge since would break code pattern and mess up conext (or just for bigger context)

I guess they used RL to treat "TAB completion steps" as "reasoning steps" so that it can use "diff" as "environment reward signal" to learn "what is a good edit" without explicit labeling.

Actual Data

Data Filter

Data Labeling

Github Copilot

Inline Completion

Trigger:

Input:

Output:

Post Processing:

Document line: "foo(x, |y, z)"    cursor at | (position.character = 7)
restOfLine = "y, z)"

LLM completion: "y=10, z=20)"

Greedy match:
  'y' → indexOf('y', 0) = 0    suffixLength=1, lastIndex=0
  ',' → indexOf(',', 1) = 3    suffixLength=2, lastIndex=3
  ' ' → indexOf(' ', 4) = 4    suffixLength=3, lastIndex=4
  'z' → indexOf('z', 5) = 6    suffixLength=4, lastIndex=6
  ')' → indexOf(')', 7) = 10   suffixLength=5, lastIndex=10
-> return 5 (= all of restOfLine)

Effect: VS Code replaces "y, z)" entirely, showing:
  foo(x, y=10, z=20)

Abstractly, it separates "edit" output from "complete" output by split using AST boundaries.

Next Edit Prediction

Trigger:

Input:

Output:

Speculative Request: When a NES is shown, the system immediately pre-computes the next NES assuming the user will accept. This makes chained suggestions feel instant.

Future of NEP is RL

While RL for reasoning might not add intelligence, it enable us to "train on non-data". Model development has shifted from "data mining" to "Verification Asymmetry mining". The cost of post training has surpassed pretraining in 2026.

Table of Content