What is Context Engineering?

Context engineering is the discipline of constructing the right context for each LLM call. It determines what information, examples, constraints, and learned patterns surround every instruction you give to a language model.

The term was popularized by Andrej Karpathy, who described it as "the delicate art and science of filling the context window with just the right information for the next step."

This is different from prompt engineering. Prompt engineering focuses on how you phrase the instruction. Context engineering focuses on what surrounds the instruction: the examples, the retrieved documents, the conversation history, the system-level constraints, and the learned strategies that shape how the model interprets and executes that instruction.

A well-engineered prompt with poor context will underperform. A mediocre prompt with excellent context will often succeed. Context is the bigger lever.

Why Context Engineering Matters for Agents

For a single LLM call, context engineering is important. For AI agents, it is essential.

Agents operate across multiple steps. At each step, the model makes a decision based on what is in its context window. The context changes after every action: new tool results arrive, previous outputs accumulate, the task state evolves. What context the agent has at step 7 determines whether it recovers from the mistake it made at step 3.

This creates challenges that don't exist in single-call scenarios:

Context accumulates. Every tool call, observation, and intermediate result adds tokens. By step 20, the context window may be bloated with irrelevant history while critical early information has been pushed out of effective attention range.
Attention degrades with length. Research shows that LLM accuracy drops from 90% to 51% in longer conversations (Microsoft, 2024). Information in the middle of the context is retrieved less reliably than information at the start or end.
One bad step compounds. If the agent lacks the right context at a decision point, it takes a wrong action. That wrong action produces misleading results that pollute context for all subsequent steps.

The implication: for agents, context engineering is not a one-time setup. It is a continuous process that must adapt at every step of execution.

The Problem: Manual Context Engineering Does Not Scale

Most teams start with manual context engineering. They write system prompts, create CLAUDE.md or CURSOR_RULES files, maintain example libraries, and hand-craft few-shot demonstrations.

This works initially. Then it stops working.

The maintenance problem. Static context goes stale. As your agent encounters new edge cases, someone has to identify the gap, write the new instruction, and add it to the prompt. This requires an engineer who understands both the agent's behavior and the domain well enough to write effective guidance.

The bloat problem. Over time, system prompts accumulate instructions for every edge case. Context windows fill with rules that are irrelevant to the current task. Each irrelevant element actively degrades performance -- research shows that even one distractor reduces LLM accuracy.

The discovery problem. The most valuable context often comes from patterns that are invisible in individual interactions but clear across hundreds of traces. "When the user mentions a return and a replacement in the same message, always handle the return first" -- that kind of insight only emerges from data. No one writes it into a system prompt proactively.

The scaling problem. If you have one agent doing one task, manual context engineering is feasible. If you have dozens of agents across multiple domains, each needing domain-specific strategies that evolve weekly, it is not.

The Research: Agentic Context Engineering

In October 2025, researchers at Stanford University and SambaNova Systems published a paper that formalized context engineering for agents as a machine learning problem. The paper, Agentic Context Engineering (arXiv:2510.04618), introduced the ACE framework -- a three-agent architecture that builds and refines agent context automatically from execution feedback.

The core insight: instead of manually writing system prompts, let the agent learn what context it needs from its own experience.

ACE introduced several key concepts:

The Skillbook (Dynamic Playbook)

A structured repository of learned strategies, organized by task category. Each entry (called a "skill" or "bullet") includes the strategy itself, metadata about when it was learned, and counters tracking how often it has been helpful or harmful in practice.

Unlike a static prompt, the Skillbook evolves. New skills are added when the system discovers effective patterns. Existing skills are refined or removed based on ongoing feedback.

Three-Agent Architecture

ACE separates the learning process into three specialized roles:

Generator -- Executes tasks using strategies retrieved from the Skillbook
Reflector -- Analyzes execution outcomes to identify what worked and what failed
Curator -- Updates the Skillbook based on the Reflector's analysis, using structured delta operations

This separation ensures that the agent doing the work is not also responsible for evaluating its own performance -- a design choice that improves the quality of learning.

Delta Updates

A critical technical detail: when updating context, ACE never asks the LLM to rewrite the entire Skillbook. LLMs exhibit brevity bias when rewriting -- they compress and lose detail. Instead, ACE uses delta operations (add, remove, modify) that make surgical changes to specific entries. This preserves the accumulated knowledge while incorporating new learnings.

Results

The ACE paper demonstrated significant improvements across benchmarks:

+10.6 percentage points on the AppWorld agent benchmark vs. strong baselines
+17.1 percentage points vs. base LLM performance
86.9% lower adaptation latency compared to existing context-adaptation methods

These improvements compound over time. As the Skillbook accumulates more validated strategies, the agent encounters fewer novel failure modes.

How Kayba Automates Context Engineering

Kayba is the open-source implementation of automated context engineering for AI agents. It takes the concepts from the ACE paper and combines them with additional research to create a production-ready system.

The pipeline works in four stages:

1. Trace Analysis (Recursive Reflector)

Kayba's analysis goes beyond single-pass LLM review. The Recursive Reflector, inspired by the Recursive Language Models (RLM) paper, uses a REPL-based approach where the LLM writes code to programmatically analyze traces. Instead of asking the model to "summarize what went wrong," it lets the model filter, count, cross-reference, and verify patterns across traces.

This produces quantitative, evidence-backed insights rather than surface-level summaries.

2. Skill Extraction

From the analysis, Kayba extracts discrete skills -- specific, actionable strategies that can be applied to future tasks. Each skill is tied to evidence from the traces that produced it, so you can verify why the system believes this strategy matters.

Skills are automatically deduplicated using semantic similarity, preventing the Skillbook from accumulating redundant entries.

3. Human Review

Extracted skills go through a review workflow. You can approve, edit, or reject each skill before it becomes part of the active Skillbook. This human-in-the-loop step ensures that automated learning doesn't introduce harmful patterns.

Every skill shows its helpful/harmful counters, so you can see at a glance which strategies are working and which need attention.

4. Prompt Generation

Approved skills are compiled into system prompts, organized by section. The generated prompts contain only the strategies relevant to the task at hand -- just-in-time context rather than everything-at-once context.

This is where context engineering becomes concrete: instead of a hand-written system prompt that tries to cover every case, you get a curated, evidence-backed prompt that reflects what has actually worked for your specific agent in your specific domain.

Context Engineering vs. Prompt Engineering vs. Fine-Tuning

These three approaches to improving LLM behavior are complementary but operate at different levels:

Aspect	Prompt Engineering	Context Engineering	Fine-Tuning
What changes	The instruction phrasing	The information surrounding the instruction	The model weights
Scope	Single call	Single call or multi-step session	All future calls
Maintenance	Manual, per use case	Can be automated	Requires retraining
Adaptability	Static until manually updated	Evolves with new data	Static until retrained
Cost	Engineer time only	Inference costs for analysis	$10K+ per training run
Interpretability	Readable	Readable and auditable	Black box
Reversibility	Edit the prompt	Edit or remove specific skills	Requires retraining
Time to improve	Minutes (manual)	Hours (automated analysis)	Days to weeks
Best for	Initial setup, simple tasks	Ongoing improvement, complex agents	Fundamental capability changes

Prompt engineering is where you start. Context engineering is how you scale. Fine-tuning is for cases where the model fundamentally lacks a capability.

For most agent teams, the highest-leverage investment is moving from manual prompt engineering to automated context engineering -- it addresses the maintenance, bloat, and discovery problems simultaneously.

Getting Started

Learn the Concepts

Context Engineering for Agents -- Practical guide to context management techniques, memory types, and key principles
Agentic Context Engineering: A Complete Guide -- Deep dive into the ACE paper and three-agent architecture
The Recursive Reflector -- How Kayba uses Recursive Language Models for deeper trace analysis

Try Kayba

Install the open-source framework:

pip install ace-framework

Or use the hosted dashboard for a visual interface with team collaboration features.

Read the Research

Agentic Context Engineering (arXiv:2510.04618) -- The paper that formalized context engineering for agents
Recursive Language Models (arXiv:2512.24601) -- The REPL-based analysis pattern behind the Recursive Reflector
Anthropic: Effective Context Engineering -- Anthropic's perspective on context engineering principles

Join the Community

GitHub -- Source code, issues, and discussions
Discord -- Community support and conversation
Book a demo -- See Kayba in action on your own agent traces