May 20, 2026
token prediction techniques
next-token, multi-token, and text-order prediction in language models
Language models learn by predicting tokens. The exact prediction target changes what kind of signal the model gets during training.
NTP: next-token prediction
Next-token prediction is the standard setup.
The model sees previous tokens and predicts only the next token.
┌────────────────┐ ┌───────┐ ┌────────────┐
│ │ │ │ │ │
│ Context tokens ├────►│ Model ├────►│ Next token │
│ │ │ │ │ │
└────────────────┘ └───────┘ └────────────┘open mermaid diagram
┌────────────────┐ ┌───────┐ ┌────────────┐
│ │ │ │ │ │
│ Context tokens ├────►│ Model ├────►│ Next token │
│ │ │ │ │ │
└────────────────┘ └───────┘ └────────────┘This is used everywhere because it is simple, scalable, and matches how generation works.
MTP: multi-token prediction
Multi-token prediction asks the model to predict more than one future token at once.
Instead of predicting only token t + 1, it may predict the next 5 tokens.
┌────────────────┐ ┌───────┐ ┌──────────┐
│ │ │ │ │ │
│ Context tokens ├────►│ Model ├────►│ Token +1 │
│ │ │ │ │ │
└────────────────┘ └───┬───┘ └──────────┘
│
│
│
│
│
│ ┌──────────┐
│ │ │
├────────►│ Token +2 │
│ │ │
│ └──────────┘
│
│
│
│
│
│ ┌──────────┐
│ │ │
└────────►│ Token +3 │
│ │
└──────────┘open mermaid diagram
┌────────────────┐ ┌───────┐ ┌──────────┐
│ │ │ │ │ │
│ Context tokens ├────►│ Model ├────►│ Token +1 │
│ │ │ │ │ │
└────────────────┘ └───┬───┘ └──────────┘
│
│
│
│
│
│ ┌──────────┐
│ │ │
├────────►│ Token +2 │
│ │ │
│ └──────────┘
│
│
│
│
│
│ ┌──────────┐
│ │ │
└────────►│ Token +3 │
│ │
└──────────┘Meta has explored this in research, and DeepSeek-style models have used related ideas.
The point is to give the model a richer training signal about the near future.
TOP: text-order prediction
Text-order prediction is like a mix of NTP and MTP.
The model still learns the next-token routine, but it also learns the order of upcoming possible tokens.
A token predicted 1 step away is coming very soon. A token predicted 2 steps away is coming soon. Tokens outside the fixed future window are treated as not appearing in that window.
┌─────────┐ ┌───────┐ ┌─────────────────────┐
│ │ │ │ │ │
│ Context ├────►│ Model ├────►│ Next token │
│ │ │ │ │ │
└─────────┘ └───┬───┘ └─────────────────────┘
│
│
│
│
│
│ ┌─────────────────────┐
│ │ │
└────────►│ Future order window │
│ │
└─────────────────────┘open mermaid diagram
┌─────────┐ ┌───────┐ ┌─────────────────────┐
│ │ │ │ │ │
│ Context ├────►│ Model ├────►│ Next token │
│ │ │ │ │ │
└─────────┘ └───┬───┘ └─────────────────────┘
│
│
│
│
│
│ ┌─────────────────────┐
│ │ │
└────────►│ Future order window │
│ │
└─────────────────────┘The fixed window keeps the task bounded while still teaching the model about near-future structure.