Recursive Language Models

Field Notes: Investigating the "Infinite Context" Hypothesis

The industry claims we have solved the context window problem. They claim we can now feed millions of tokens—entire libraries of code or literature—into a model and get perfect recall.

This is likely incorrect. The mechanism by which it works is not what you think. It is not "memory." It is a trick.

1. The Problem: Context Rot

Before examining the solution, we must audit the failure mode. Standard Large Language Models (LLMs) have a "Context Window." As you fill this window, performance does not stay linear. It degrades. This is "Context Rot."

In the simulation below, we run a standard "Needle in a Haystack" test. We hide a password inside a pile of random text.

Experiment A: Attention Decay

Hypothesis: As context length ($L$) increases, retrieval accuracy ($A$) approaches random chance.

Context Load: 1,000 Tokens

Start End of Context

Signal Integrity: 99.8%

// Observation: At low context, attention is sharp.

Why does this happen? Because the mechanism—Self-Attention—is quadratic ($O(N^2)$). The noise drowns out the signal. The model isn't "reading"; it is statistically guessing based on weighted associations.

2. The Failed Fix: Compaction

The industry's first reaction was "Compaction" (or RAG/Summarization). If the book is too long, write a summary. If the summary is too long, summarize the summary.

This creates a lossy compression artifact. You are trading resolution for length.

Experiment B: Semantic Entropy

Method: Recursive summarization of a specific narrative detail.

Generation: 0

"The suspect was wearing a vintage 1980s red Casio digital watch with a scratched screen."

Compaction works for "the gist." It fails catastrophically for Code Auditing, where a single line of code (the needle) might crash the system.

3. The "Recursive" Workaround

Researchers now propose Recursive Language Models (RLMs). The marketing says "Infinite Context."

The reality? It's Scaffolding. They treat the prompt not as tokens to be ingested, but as an external file system to be queried via code.

Cost = Steps × (Input + Compute)

Instead of reading the book, the model writes a Python script to `grep` the book. If `grep` returns too many results, it writes another script to filter those results. It recurses.

Let's test this. You act as the Supervisor. We have a massive 10GB log file. You cannot read it. You must demand the RLM find the error.

Experiment C: The RLM Environment

Task Complexity:

System initialized. Environment: Python REPL.

Context: 'server_logs.txt' (10GB) mounted as external drive.

> Awaiting prompt...

Analysis: Waiting for execution.

The Skeptical Takeaway: Notice the delay? Notice the "Steps" in the equation above? The RLM isn't smarter. It's just more persistent.

If the task is simple (RegEx), it works. If the task requires understanding the relationship between two distant lines, the model often enters a "trajectory loop"—writing code, failing, rewriting, failing—until it hits a timeout.

Conclusion: No Free Lunch

Recursive Language Models solve the memory problem by converting it into a compute problem.

Standard LLM: High Memory, Fast Inference, Low Accuracy at Scale.
RLM: Low Memory, High Latency, High Accuracy (if the code works).

We haven't solved intelligence. We've just given the model a file system and a Python compiler. Proceed with caution.