Single-pass LLM analysis is shallow. Ask a model to review a 200-turn agent trace and you get a surface-level summary: "the agent failed to resolve the billing dispute." You don't get why, you don't get quantitative patterns, and you don't get evidence you can act on.

The Recursive Language Models (RLM) paper from Zhang, Kraska, and Khattab introduces an elegant solution: give the LLM a Python REPL and let it write code to analyze the data.

We adapted this pattern to build Kayba's Recursive Reflector, a trace analysis system that produces deeper, evidence-backed insights than any single-pass approach.

The RLM Pattern

The core insight from the RLM paper is simple: instead of feeding massive inputs directly into the context window, load them into a Python REPL environment as variables. The LLM then writes code to examine, filter, and query the data programmatically.

The REPL provides two primitives:

context: a string variable containing the full input data
llm_query(): a function that recursively calls the LLM on smaller chunks

This turns the model from a passive reader into an active investigator. Instead of trying to hold everything in attention at once, it can:

# Filter 1000+ traces to just the failures
failures = [t for t in traces if t["outcome"] == "failure"]

# Count error types
from collections import Counter
error_types = Counter(t["error_type"] for t in failures)
print(error_types.most_common(5))

The numbers are striking. On BrowseComp-Plus (multi-document QA over 1K documents), base GPT-5 scores 0% because the input simply doesn't fit. RLM-equipped GPT-5 scores 91.33%. On OOLONG-Pairs, base models achieve < 0.1%; RLMs reach 23–58%.

The key finding: the REPL environment alone enables context scaling. Recursive sub-calls via llm_query() then add 10–59% gains on information-dense tasks.

Emergent Behaviors

The paper documents four patterns that RLMs develop spontaneously, with no explicit instruction required:

1. Intelligent Filtering. The model uses regex and keyword queries to narrow the search space before doing expensive semantic analysis. It doesn't read everything; it finds what matters first.

2. Recursive Decomposition. For large inputs, the model chunks by newline or section, delegates semantic reasoning to llm_query() sub-calls, and aggregates results programmatically.

3. Answer Verification. Instead of trusting its first analysis, the model generates verification code that cross-references claims against the raw data to catch its own errors.

4. Variable Stitching. For outputs that exceed the model's generation limit, it constructs the final answer incrementally across multiple sub-calls, stored in Python variables.

These behaviors emerge because code execution gives the LLM capabilities that text-based reasoning can't match: precise filtering, exact counting, structured comparison, and iterative refinement.

From Long Context to Trace Analysis

The RLM paper targets general long-context tasks. We saw a direct application for agent improvement: analyzing execution traces.

A production agent generates traces with hundreds of turns, tool calls, API responses, and decision points. A single-pass reflector reads the trace and produces a qualitative summary. A recursive reflector writes code to investigate the trace:

How many turns did the agent spend on the wrong approach before correcting?
Did the agent contradict itself between turn 12 and turn 47?
Which skills from the current library were applied vs. ignored?
Across 50 traces, what are the top 5 failure modes by frequency?

These are questions that require counting, filtering, and cross-referencing, exactly what code execution enables.

How Kayba's Recursive Reflector Works

Our implementation adapts the RLM pattern specifically for agent trace analysis:

Sandbox, not context. The prompt receives only metadata (trace count, token sizes, section names). The full trace data is injected into a sandboxed Python namespace as structured objects. This keeps the prompt small and the LLM focused on what to analyze, not drowning in raw data.

Structured access. A TraceContext object provides typed access to traces, turns, tool calls, and outcomes. The LLM explores the data through code rather than scanning text:

# Find the most common errors across all traces
from collections import Counter
errors = Counter(t.error_message for t in traces.all() if t.failed)
for error, count in errors.most_common(3):
    print(f"{count}x: {error}")

# Output:
# 14x: Failed to compile: missing semicolon
#  9x: API timeout after 30s
#  6x: Schema validation error on field 'email'

Bounded execution. Configurable timeout (30s default) and a hard cap of 20 llm_query() calls prevent runaway costs. The RLM paper notes that cost variance is the main risk. Our bounds keep it predictable.

REPL loop. The reflector runs in iterations. Each iteration: the LLM reads prior outputs, generates analysis code, executes it, and decides whether to continue or synthesize. Most analyses converge in 2–3 iterations.

Three modes. ReflectorMode.SIMPLE for fast single-pass analysis, ReflectorMode.RECURSIVE for deep investigation, and ReflectorMode.AUTO which routes based on trace complexity.

What Changes With Code Execution

The difference between single-pass and recursive reflection is the difference between a summary and an investigation.

Single-pass reflection on a customer support trace:

The agent failed to process the return correctly.

Recursive reflection on the same trace:

The agent mentioned the return policy 4 times across turns 7–19 without acknowledging the customer's email confirmation as valid proof of purchase. 0 existing skills address digital proof of purchase. The agent's response length increased 40% per turn, indicating escalating repetition rather than progress.

The second version includes specific turn numbers, counts, a gap analysis against the skill library, and a quantitative signal (response length growth). All derived from code execution against the trace data.

This precision flows downstream. Better analysis → more specific skills → more effective prompts → fewer agent failures.

Resources

Recursive Language Models (arXiv:2512.24601) by Zhang, Kraska, and Khattab
Kayba's Recursive Reflector implementation: ace/reflector/recursive.py
Agentic Context Engineering (arXiv:2510.04618): the ACE framework paper

The Recursive Reflector: How Recursive Language Models Power Deeper Agent Analysis