The Illusion of Infinite Memory
Even with 1M+ token windows, long-context models degrade once you fill a large fraction of the window. The middle is a dead zone; information survives best at the edges.
Key Signals
- "Lost in the Middle": Liu et al. show long-context LLMs peak at the start and end; accuracy drops sharply in the middle of the window.
- Industry Warnings: Vendors of 1M+ models still recommend retrieval and chunking. Infinite windows are theoretical, not operational.
Compression as Alignment
A shorter program that explains data is a better hypothesis. Compression gives us an objective signal for model quality and truth-seeking.
We build systems that minimize entropic footprint instead of maximizing token count, so the representation stays causal and legible.
Key Signals
- Kolmogorov Prior: Prefer hypotheses with minimal description length; this keeps models grounded in the simplest world model that fits the evidence.
- Operational Simplicity: Compression pressure discourages prompt sprawl and rewards structure, hierarchy, and retrieval that actually matters.
5 Drivers of Decay
Why long-context performance collapses as you fill the window.
Positional Bias
Accuracy peaks at the start and end of the window. Middle tokens are forgotten; adding more context can hurt.
Capacity Dilution
Same parameters, 100× more tokens. Facts compress into overlapping features; interference rises.
Parameter saturation
Noise & Distractors
Long prompts carry duplication and off-topic drift. Attention spreads over junk, hurting multi-hop reasoning.
Lack of Hierarchical Summarization
Flat attention has no built-in ladder: no “summarize pages 1–10 then forget raw tokens”. Lossy implicit summaries drop what matters.
Training Cutoff
Long-tail contexts are under-trained; deformation appears sooner than promised. Infinite windows remain a marketing fiction.
"After some fraction of the window, adding more tokens often degrades performance on non-trivial tasks rather than helping."
Structured World Models
We don't feed the raw stream. We build models that compress observation into a structured latent space.
Flat Attention
[Token 1] → [Token 2] → … → [Token 1M]
Attention(Q, K, V) is O(N²)
Noise accumulates.
Reasoning fails.
Hierarchical Compression
Raw Data → [Compressor] → Latent Node A
Raw Data → [Compressor] → Latent Node B
Reasoning(Node A, Node B)
Context is infinite via structure.