Back to Home

What is Context Engineering for AI Agents?

Context engineering is the discipline of building the right context for every AI agent step. Learn how it works, why it matters, and how Kayba automates it.

March 11, 2026
EducationalContext EngineeringACEAgentic Context Engineering

What is Context Engineering?

Context engineering is the discipline of constructing the right context for each LLM call. It determines what information, examples, constraints, and learned patterns surround every instruction you give to a language model.

The term was popularized by Andrej Karpathy, who described it as "the delicate art and science of filling the context window with just the right information for the next step."

This is different from prompt engineering. Prompt engineering focuses on how you phrase the instruction. Context engineering focuses on what surrounds the instruction: the examples, the retrieved documents, the conversation history, the system-level constraints, and the learned strategies that shape how the model interprets and executes that instruction.

A well-engineered prompt with poor context will underperform. A mediocre prompt with excellent context will often succeed. Context is the bigger lever.

Why Context Engineering Matters for Agents

For a single LLM call, context engineering is important. For AI agents, it is essential.

Agents operate across multiple steps. At each step, the model makes a decision based on what is in its context window. The context changes after every action: new tool results arrive, previous outputs accumulate, the task state evolves. What context the agent has at step 7 determines whether it recovers from the mistake it made at step 3.

This creates challenges that don't exist in single-call scenarios:

  • Context accumulates. Every tool call, observation, and intermediate result adds tokens. By step 20, the context window may be bloated with irrelevant history while critical early information has been pushed out of effective attention range.
  • Attention degrades with length. Research shows that LLM accuracy drops from 90% to 51% in longer conversations (Microsoft, 2024). Information in the middle of the context is retrieved less reliably than information at the start or end.
  • One bad step compounds. If the agent lacks the right context at a decision point, it takes a wrong action. That wrong action produces misleading results that pollute context for all subsequent steps.

The implication: for agents, context engineering is not a one-time setup. It is a continuous process that must adapt at every step of execution.

The Problem: Manual Context Engineering Does Not Scale

Most teams start with manual context engineering. They write system prompts, create CLAUDE.md or CURSOR_RULES files, maintain example libraries, and hand-craft few-shot demonstrations.

This works initially. Then it stops working.

The maintenance problem. Static context goes stale. As your agent encounters new edge cases, someone has to identify the gap, write the new instruction, and add it to the prompt. This requires an engineer who understands both the agent's behavior and the domain well enough to write effective guidance.

The bloat problem. Over time, system prompts accumulate instructions for every edge case. Context windows fill with rules that are irrelevant to the current task. Each irrelevant element actively degrades performance -- research shows that even one distractor reduces LLM accuracy.

The discovery problem. The most valuable context often comes from patterns that are invisible in individual interactions but clear across hundreds of traces. "When the user mentions a return and a replacement in the same message, always handle the return first" -- that kind of insight only emerges from data. No one writes it into a system prompt proactively.

The scaling problem. If you have one agent doing one task, manual context engineering is feasible. If you have dozens of agents across multiple domains, each needing domain-specific strategies that evolve weekly, it is not.

The Research: Agentic Context Engineering

In October 2025, researchers at Stanford University and SambaNova Systems published a paper that formalized context engineering for agents as a machine learning problem. The paper, Agentic Context Engineering (arXiv:2510.04618), introduced the ACE framework -- a three-agent architecture that builds and refines agent context automatically from execution feedback.

The core insight: instead of manually writing system prompts, let the agent learn what context it needs from its own experience.

ACE introduced several key concepts:

The Skillbook (Dynamic Playbook)

A structured repository of learned strategies, organized by task category. Each entry (called a "skill" or "bullet") includes the strategy itself, metadata about when it was learned, and counters tracking how often it has been helpful or harmful in practice.

Unlike a static prompt, the Skillbook evolves. New skills are added when the system discovers effective patterns. Existing skills are refined or removed based on ongoing feedback.

Three-Agent Architecture

ACE separates the learning process into three specialized roles:

  1. Generator -- Executes tasks using strategies retrieved from the Skillbook
  2. Reflector -- Analyzes execution outcomes to identify what worked and what failed
  3. Curator -- Updates the Skillbook based on the Reflector's analysis, using structured delta operations

This separation ensures that the agent doing the work is not also responsible for evaluating its own performance -- a design choice that improves the quality of learning.

Delta Updates

A critical technical detail: when updating context, ACE never asks the LLM to rewrite the entire Skillbook. LLMs exhibit brevity bias when rewriting -- they compress and lose detail. Instead, ACE uses delta operations (add, remove, modify) that make surgical changes to specific entries. This preserves the accumulated knowledge while incorporating new learnings.

Results

The ACE paper demonstrated significant improvements across benchmarks:

  • +10.6 percentage points on the AppWorld agent benchmark vs. strong baselines
  • +17.1 percentage points vs. base LLM performance
  • 86.9% lower adaptation latency compared to existing context-adaptation methods

These improvements compound over time. As the Skillbook accumulates more validated strategies, the agent encounters fewer novel failure modes.

How Kayba Automates Context Engineering

Kayba is the open-source implementation of automated context engineering for AI agents. It takes the concepts from the ACE paper and combines them with additional research to create a production-ready system.

The pipeline works in four stages:

1. Trace Analysis (Recursive Reflector)

Kayba's analysis goes beyond single-pass LLM review. The Recursive Reflector, inspired by the Recursive Language Models (RLM) paper, uses a REPL-based approach where the LLM writes code to programmatically analyze traces. Instead of asking the model to "summarize what went wrong," it lets the model filter, count, cross-reference, and verify patterns across traces.

This produces quantitative, evidence-backed insights rather than surface-level summaries.

2. Skill Extraction

From the analysis, Kayba extracts discrete skills -- specific, actionable strategies that can be applied to future tasks. Each skill is tied to evidence from the traces that produced it, so you can verify why the system believes this strategy matters.

Skills are automatically deduplicated using semantic similarity, preventing the Skillbook from accumulating redundant entries.

3. Human Review

Extracted skills go through a review workflow. You can approve, edit, or reject each skill before it becomes part of the active Skillbook. This human-in-the-loop step ensures that automated learning doesn't introduce harmful patterns.

Every skill shows its helpful/harmful counters, so you can see at a glance which strategies are working and which need attention.

4. Prompt Generation

Approved skills are compiled into system prompts, organized by section. The generated prompts contain only the strategies relevant to the task at hand -- just-in-time context rather than everything-at-once context.

This is where context engineering becomes concrete: instead of a hand-written system prompt that tries to cover every case, you get a curated, evidence-backed prompt that reflects what has actually worked for your specific agent in your specific domain.

Context Engineering vs. Prompt Engineering vs. Fine-Tuning

These three approaches to improving LLM behavior are complementary but operate at different levels:

AspectPrompt EngineeringContext EngineeringFine-Tuning
What changesThe instruction phrasingThe information surrounding the instructionThe model weights
ScopeSingle callSingle call or multi-step sessionAll future calls
MaintenanceManual, per use caseCan be automatedRequires retraining
AdaptabilityStatic until manually updatedEvolves with new dataStatic until retrained
CostEngineer time onlyInference costs for analysis$10K+ per training run
InterpretabilityReadableReadable and auditableBlack box
ReversibilityEdit the promptEdit or remove specific skillsRequires retraining
Time to improveMinutes (manual)Hours (automated analysis)Days to weeks
Best forInitial setup, simple tasksOngoing improvement, complex agentsFundamental capability changes

Prompt engineering is where you start. Context engineering is how you scale. Fine-tuning is for cases where the model fundamentally lacks a capability.

For most agent teams, the highest-leverage investment is moving from manual prompt engineering to automated context engineering -- it addresses the maintenance, bloat, and discovery problems simultaneously.

Getting Started

Learn the Concepts

Try Kayba

Install the open-source framework:

pip install ace-framework

Or use the hosted dashboard for a visual interface with team collaboration features.

Read the Research

Join the Community

  • GitHub -- Source code, issues, and discussions
  • Discord -- Community support and conversation
  • Book a demo -- See Kayba in action on your own agent traces