TL;DR

Long context is not only about longer sequences. Compress text into vision to cut token costs.
DeepSeek-OCR shows Context Optical Compression: at less than 10x compression the model decodes with ~97% accuracy; around 20x it is ~60%.
DeepEncoder uses windowed attention, a 16× convolutional compressor, then global attention to keep visual tokens small while retaining detail.
This enables a “forgetting curve”: recent messages stay as text; older ones become lower resolution images. Cost scales gently over time.
Practical takeaway: treat documents as 2D maps rather than only 1D streams when pushing for longer, cheaper context.

Human memory is a luxurious compromise. We vividly remember what happened minutes ago, yet allow the details from weeks past to fade away. This “gradual forgetting” is not a flaw; it is a vital mechanism that keeps our minds functional.

We often push large language models (LLMs) in the opposite direction, expecting “perfect memory” over infinite context. That collides with physical limits: as sequence length grows, computation rises quadratically.

Maybe the question is wrong. What if the bottleneck is not memory size but the assumption that models must read to remember? DeepSeek-AI proposes Context Optical Compression (COC), challenging our bias toward text-only thinking.

Escaping the Chains of Sequence: When “Reading” Becomes “Compression”

We treat language as a one-dimensional stream of tokens. Humans, however, often perceive documents as two-dimensional layouts.

DeepSeek-AI’s question: can vision serve as an efficient medium to compress text context?
Example: an A4 page might be ~1,000 text tokens, yet ~100 visual tokens can capture its information at similar fidelity.
DeepSeek-OCR demonstrates this. At compression ratios under 10×, decoding accuracy is ~97%. At ~20×, it is ~60%.
Conclusion: compact language models can learn to decode highly compressed visual representations.
Implication: long context may be easier in 2D visual space than in 1D text space.

OCR is a first beachhead for this approach. It quantifies a compression–decompression mapping between vision and text and shows the path is practical.

The Alchemy of Efficiency: How DeepEncoder “Sees More, Eats Less”

This idea needs a different engine. Traditional VLM encoders struggle with high-resolution documents. Dual-tower or tiling systems are hard to deploy and inflate visual tokens. Others generate large activations and overload GPUs.

The heart of DeepSeek-OCR is DeepEncoder, a clever architectural innovation that balances high resolution with low computational cost through a two-step serial design:

Stage	What it does
Perception (window attention)	High resolution input is handled with local windows that keep activation memory low.
Compression (convolutional compressor)	Thousands of patch tokens (for example, 4,096 from 1024×1024) are reduced to 256 via a 16× conv compressor.
Cognition (global attention)	The 256 tokens feed a global attention module (based on CLIP-large) for understanding, now computationally manageable.

Separating perception from cognition is the key. Local attention processes pixels; the compressor keeps only the essence; global attention reasons over a compact set.

This architecture allows the model to handle high-resolution input efficiently: low activation, minimal visual tokens, and strong performance.

In OmniDocBench, DeepSeek-OCR used fewer than 800 visual tokens and outperformed MinerU2.0, which needed nearly 7,000. That is a practical efficiency win.

The Ultimate Vision: From “Optical Compression” to “Forgetting Mechanism”

DeepSeek-OCR’s value goes beyond document processing. It hints at an architecture that simulates memory and forgetting.

The most thought-provoking part of the research is Figure 13, which aligns three axes:

Axis	Crystal clear	Almost gone
Human memory	Just happened	After one year
Human vision	10 cm	20 metres
Text compression	Text tokens	“Tiny” visual mode

These three curves run in parallel. DeepSeek-AI’s hypothesis: Context Optical Compression can mimic biological forgetting at very low cost.

Imagine a future LLM handling a months-long conversation:

Recency	Representation	Notes
Just happened	Text tokens	Maximum clarity
One hour ago	High resolution images (“Large”)	About 10× smaller, minimal loss
One week ago	Downsampled images (“Base”/“Small”)	Slight blur, cheaper
One year ago	Low resolution images (“Tiny”)	Faint traces remain

This yields a human-like forgetting curve. Recent memories stay sharp; distant ones fade naturally; cost remains low.

It points toward a possible path to theoretically infinite context, with a dynamic balance between information retention and resource limits.

AI would not need an infinite, costly store. It could maintain a biological-like hierarchy that flows and prioritises information.

The False Problem of Dimensional Limits

Despite its name, DeepSeek-OCR is more than OCR. Its goal is to upgrade one-dimensional language sequences into two-dimensional visual maps.

This is not only a technical shift; it is a reframe of information itself.

The long-context bottleneck, treated for years as a hard limit, may be a dimensional trap. When we step into 2D, the path looks different.

The answer may be in what humans rely on most: vision.

Is AI’s “Perfect Memory” a Disaster? DeepSeek-OCR Accidentally Unlocks the “Human Forgetting Mechanism”

Table of Contents

TL;DR

Escaping the Chains of Sequence: When “Reading” Becomes “Compression”

The Alchemy of Efficiency: How DeepEncoder “Sees More, Eats Less”

The Ultimate Vision: From “Optical Compression” to “Forgetting Mechanism”

The False Problem of Dimensional Limits

Found this useful? Share it!

Recent posts

Was this page helpful?

Related Posts