The compression
company

|C(concat(X,Y))|<|C(X)|+|C(Y)|+O(1)

We are chasing infinite context windows but hitting a wall. Even in 1M+ token models, performance degrades once you fill the window.

01The Observation

The Illusion of Infinite Memory

Even with 1M+ token windows, long-context models degrade once you fill a large fraction of the window. The middle is a dead zone; information survives best at the edges.

Key Signals

  • "Lost in the Middle": Liu et al. show long-context LLMs peak at the start and end; accuracy drops sharply in the middle of the window.
  • Industry Warnings: Vendors of 1M+ models still recommend retrieval and chunking. Infinite windows are theoretical, not operational.
02The Principle

Compression as Alignment

A shorter program that explains data is a better hypothesis. Compression gives us an objective signal for model quality and truth-seeking.

We build systems that minimize entropic footprint instead of maximizing token count, so the representation stays causal and legible.

Key Signals

  • Kolmogorov Prior: Prefer hypotheses with minimal description length; this keeps models grounded in the simplest world model that fits the evidence.
  • Operational Simplicity: Compression pressure discourages prompt sprawl and rewards structure, hierarchy, and retrieval that actually matters.
03 — The Frictions

5 Drivers of Decay

Why long-context performance collapses as you fill the window.

Positional Bias

Accuracy peaks at the start and end of the window. Middle tokens are forgotten; adding more context can hurt.

Capacity Dilution

Same parameters, 100× more tokens. Facts compress into overlapping features; interference rises.

Parameter saturation

Noise & Distractors

Long prompts carry duplication and off-topic drift. Attention spreads over junk, hurting multi-hop reasoning.

Lack of Hierarchical Summarization

Flat attention has no built-in ladder: no “summarize pages 1–10 then forget raw tokens”. Lossy implicit summaries drop what matters.

Training Cutoff

Long-tail contexts are under-trained; deformation appears sooner than promised. Infinite windows remain a marketing fiction.

"After some fraction of the window, adding more tokens often degrades performance on non-trivial tasks rather than helping."

The Solution

Structured World Models

We don't feed the raw stream. We build models that compress observation into a structured latent space.

Flat Attention

[Token 1] → [Token 2] → … → [Token 1M]

Attention(Q, K, V) is O(N²)

Noise accumulates.

Reasoning fails.

Hierarchical Compression

Raw Data → [Compressor] → Latent Node A

Raw Data → [Compressor] → Latent Node B

Reasoning(Node A, Node B)

Context is infinite via structure.

Thought Labs Manifesto← Back to Home