TL;DR
- Long context is not only about longer sequences. Compress text into vision to cut token costs.
- DeepSeek-OCR shows Context Optical Compression: at less than 10x compression the model decodes with ~97% accuracy; around 20x it is ~60%.
- DeepEncoder uses windowed attention, a 16× convolutional compressor, then global attention to keep visual tokens small while retaining detail.
- This enables a “forgetting curve”: recent messages stay as text; older ones become lower resolution images. Cost scales gently over time.
- Practical takeaway: treat documents as 2D maps rather than only 1D streams when pushing for longer, cheaper context.
Human memory is a luxurious compromise. We vividly remember what happened minutes ago, yet allow the details from weeks past to fade away. This “gradual forgetting” is not a flaw; it is a vital mechanism that keeps our minds functional.
We often push large language models (LLMs) in the opposite direction, expecting “perfect memory” over infinite context. That collides with physical limits: as sequence length grows, computation rises quadratically.
Maybe the question is wrong. What if the bottleneck is not memory size but the assumption that models must read to remember? DeepSeek-AI proposes Context Optical Compression (COC), challenging our bias toward text-only thinking.
Escaping the Chains of Sequence: When “Reading” Becomes “Compression”
We treat language as a one-dimensional stream of tokens. Humans, however, often perceive documents as two-dimensional layouts.
- DeepSeek-AI’s question: can vision serve as an efficient medium to compress text context?
- Example: an A4 page might be ~1,000 text tokens, yet ~100 visual tokens can capture its information at similar fidelity.
- DeepSeek-OCR demonstrates this. At compression ratios under 10×, decoding accuracy is ~97%. At ~20×, it is ~60%.
- Conclusion: compact language models can learn to decode highly compressed visual representations.
- Implication: long context may be easier in 2D visual space than in 1D text space.
OCR is a first beachhead for this approach. It quantifies a compression–decompression mapping between vision and text and shows the path is practical.
The Alchemy of Efficiency: How DeepEncoder “Sees More, Eats Less”
This idea needs a different engine. Traditional VLM encoders struggle with high-resolution documents. Dual-tower or tiling systems are hard to deploy and inflate visual tokens. Others generate large activations and overload GPUs.
The heart of DeepSeek-OCR is DeepEncoder, a clever architectural innovation that balances high resolution with low computational cost through a two-step serial design:
| Stage | What it does |
|---|---|
| Perception (window attention) | High resolution input is handled with local windows that keep activation memory low. |
| Compression (convolutional compressor) | Thousands of patch tokens (for example, 4,096 from 1024×1024) are reduced to 256 via a 16× conv compressor. |
| Cognition (global attention) | The 256 tokens feed a global attention module (based on CLIP-large) for understanding, now computationally manageable. |
Separating perception from cognition is the key. Local attention processes pixels; the compressor keeps only the essence; global attention reasons over a compact set.
This architecture allows the model to handle high-resolution input efficiently: low activation, minimal visual tokens, and strong performance.
In OmniDocBench, DeepSeek-OCR used fewer than 800 visual tokens and outperformed MinerU2.0, which needed nearly 7,000. That is a practical efficiency win.
The Ultimate Vision: From “Optical Compression” to “Forgetting Mechanism”
DeepSeek-OCR’s value goes beyond document processing. It hints at an architecture that simulates memory and forgetting.
The most thought-provoking part of the research is Figure 13, which aligns three axes:
| Axis | Crystal clear | Almost gone |
|---|---|---|
| Human memory | Just happened | After one year |
| Human vision | 10 cm | 20 metres |
| Text compression | Text tokens | “Tiny” visual mode |
These three curves run in parallel. DeepSeek-AI’s hypothesis: Context Optical Compression can mimic biological forgetting at very low cost.
Imagine a future LLM handling a months-long conversation:
| Recency | Representation | Notes |
|---|---|---|
| Just happened | Text tokens | Maximum clarity |
| One hour ago | High resolution images (“Large”) | About 10× smaller, minimal loss |
| One week ago | Downsampled images (“Base”/“Small”) | Slight blur, cheaper |
| One year ago | Low resolution images (“Tiny”) | Faint traces remain |
This yields a human-like forgetting curve. Recent memories stay sharp; distant ones fade naturally; cost remains low.
It points toward a possible path to theoretically infinite context, with a dynamic balance between information retention and resource limits.
AI would not need an infinite, costly store. It could maintain a biological-like hierarchy that flows and prioritises information.
The False Problem of Dimensional Limits
Despite its name, DeepSeek-OCR is more than OCR. Its goal is to upgrade one-dimensional language sequences into two-dimensional visual maps.
This is not only a technical shift; it is a reframe of information itself.
The long-context bottleneck, treated for years as a hard limit, may be a dimensional trap. When we step into 2D, the path looks different.
The answer may be in what humans rely on most: vision.