May 20, 2026

token prediction techniques

next-token, multi-token, and text-order prediction in language models

Language models learn by predicting tokens. The exact prediction target changes what kind of signal the model gets during training.

NTP: next-token prediction

Next-token prediction is the standard setup.

The model sees previous tokens and predicts only the next token.

┌────────────────┐     ┌───────┐     ┌────────────┐
│                │     │       │     │            │
│ Context tokens ├────►│ Model ├────►│ Next token │
│                │     │       │     │            │
└────────────────┘     └───────┘     └────────────┘

open mermaid diagram

┌────────────────┐     ┌───────┐     ┌────────────┐
│                │     │       │     │            │
│ Context tokens ├────►│ Model ├────►│ Next token │
│                │     │       │     │            │
└────────────────┘     └───────┘     └────────────┘

This is used everywhere because it is simple, scalable, and matches how generation works.

MTP: multi-token prediction

Multi-token prediction asks the model to predict more than one future token at once.

Instead of predicting only token t + 1, it may predict the next 5 tokens.

┌────────────────┐     ┌───────┐     ┌──────────┐
│                │     │       │     │          │
│ Context tokens ├────►│ Model ├────►│ Token +1 │
│                │     │       │     │          │
└────────────────┘     └───┬───┘     └──────────┘
                           │                     
                           │                     
                           │                     
                           │                     
                           │                     
                           │         ┌──────────┐
                           │         │          │
                           ├────────►│ Token +2 │
                           │         │          │
                           │         └──────────┘
                           │                     
                           │                     
                           │                     
                           │                     
                           │                     
                           │         ┌──────────┐
                           │         │          │
                           └────────►│ Token +3 │
                                     │          │
                                     └──────────┘

open mermaid diagram

┌────────────────┐     ┌───────┐     ┌──────────┐
│                │     │       │     │          │
│ Context tokens ├────►│ Model ├────►│ Token +1 │
│                │     │       │     │          │
└────────────────┘     └───┬───┘     └──────────┘
                           │                     
                           │                     
                           │                     
                           │                     
                           │                     
                           │         ┌──────────┐
                           │         │          │
                           ├────────►│ Token +2 │
                           │         │          │
                           │         └──────────┘
                           │                     
                           │                     
                           │                     
                           │                     
                           │                     
                           │         ┌──────────┐
                           │         │          │
                           └────────►│ Token +3 │
                                     │          │
                                     └──────────┘

Meta has explored this in research, and DeepSeek-style models have used related ideas.

The point is to give the model a richer training signal about the near future.

TOP: text-order prediction

Text-order prediction is like a mix of NTP and MTP.

The model still learns the next-token routine, but it also learns the order of upcoming possible tokens.

A token predicted 1 step away is coming very soon. A token predicted 2 steps away is coming soon. Tokens outside the fixed future window are treated as not appearing in that window.

┌─────────┐     ┌───────┐     ┌─────────────────────┐
│         │     │       │     │                     │
│ Context ├────►│ Model ├────►│      Next token     │
│         │     │       │     │                     │
└─────────┘     └───┬───┘     └─────────────────────┘
                    │                                
                    │                                
                    │                                
                    │                                
                    │                                
                    │         ┌─────────────────────┐
                    │         │                     │
                    └────────►│ Future order window │
                              │                     │
                              └─────────────────────┘

open mermaid diagram

┌─────────┐     ┌───────┐     ┌─────────────────────┐
│         │     │       │     │                     │
│ Context ├────►│ Model ├────►│      Next token     │
│         │     │       │     │                     │
└─────────┘     └───┬───┘     └─────────────────────┘
                    │                                
                    │                                
                    │                                
                    │                                
                    │                                
                    │         ┌─────────────────────┐
                    │         │                     │
                    └────────►│ Future order window │
                              │                     │
                              └─────────────────────┘

The fixed window keeps the task bounded while still teaching the model about near-future structure.