[Paper] Recursive Language Models: Using code + recursion to make LLMs read 10M+ tokens without huge context windows

TL;DR: New CSAIL/MIT paper proposes Recursive Language Models (RLMs). Instead of scaling context windows forever, the LLM treats the prompt as data inside a REPL (e.g., Python) and generates code to slice, search, and recursively call itself on sub-chunks.

Results:

Handles 10+ million tokens (way beyond GPT-5’s native context)
Outperforms context-only baselines, summarization, Retrieval-Agents, and Code-Agents
Median query cost roughly on par with standard GPT-5 calls

The problem:

LLMs hit two walls on long tasks:

Hard context limit (finite window, even if large)
Context decay (performance drops inside the window as context grows)

For real workloads (large codebases, 100k+ docs, deep research), both are fatal.

The idea — Recursive Language Models (RLMs):

Instead of stuffing everything into the LLM at once, an RLM:

launches a REPL (usually Python)
loads the raw long text as a variable
asks the LLM to write code that:
- peeks into the text programmatically (slices/search)
- splits into subproblems
- recursively spawns LLM sub-calls on demand

Basically:

It treats the prompt as a data structure, not as a single giant context blob.

To the user, it still looks like a normal LLM call.

Benchmarks:

Evaluated on long-context + high-density tasks:

S-NIAH (needle-in-haystack)
BrowseComp-Plus (deep research)
OOLONG & OOLONG-Pairs (high-density reasoning)
LongBench-v2 CodeQA (huge codebases)

Models tested:

GPT-5
Qwen3-Coder-480B

Key results:

10M+ tokens input scaling (2-3 orders above window size)
Large accuracy gains vs:
- context-only
- summarization agents
- retrieval agents
- code agents
Cost median ≈ standard GPT-5 call
Tail cost higher on some runs (recursion variance)

Even a non-recursive RLM variant (REPL-only) beats direct context runs, but recursion gives the big boost on dense tasks.

Why this matters:

It shifts the scaling bottleneck from:

“How big can we make the context window?”

“How efficiently can we programmatically access the prompt?”

Which is much closer to how humans navigate large codebases, PDFs, docs, etc.

Big picture takeaways:

Long-context scaling doesn’t need infinite windows
LLMs can offload structure to external execution
Recursion lets them “selectively read” instead of “store everything”
Opens the door to real LLM-OS / agentic / Research-At-Scale systems

If this works out in practice, it’s arguably more transformative than just pushing context from 200k → 1M → 10M windows.