The Short Answer
Manual prompt engineering works at small scale. If you have a handful of conversations per week and one person reading every trace, you can iterate by hand. But once your agent handles hundreds or thousands of interactions, manual prompting becomes the bottleneck — and it breaks silently.
Kayba automates the trace-to-prompt loop. It analyzes execution traces, extracts reusable skills, and generates better system prompts. No more reading traces at 2am trying to figure out why your agent keeps giving wrong refunds.
The Manual Prompt Engineering Loop
Every team shipping AI agents knows this cycle:
- Deploy agent with a system prompt
- Users interact with it, some conversations fail
- Someone reads through traces to find failure patterns
- They edit the system prompt to address those failures
- Redeploy and hope it works
- Repeat
This works initially. But it has fundamental problems:
You can't read every trace. At 100+ conversations per day, reviewing each one is physically impossible. Failures slip through — policy violations, hallucinations, missed escalations — and nobody notices until a customer complains.
Pattern recognition is inconsistent. Different engineers spot different patterns. One person might catch that the agent forgets to check inventory before confirming orders. Another might miss it. There's no systematic way to ensure all failure modes are captured.
Prompt edits conflict. Fix one failure mode, accidentally break another. Without a structured way to track what the agent has learned and why, every edit is a gamble. System prompts grow into thousands of tokens of accumulated patches, with no clear provenance.
It doesn't compound. Each prompt fix is a one-off. There's no mechanism for the agent to build on previous learnings across tasks and time periods. You're starting from scratch every iteration.
How Kayba Replaces the Manual Loop
Kayba sits on top of your existing agent — whatever framework you use (LangChain, CrewAI, OpenAI Agents SDK, browser-use). It replaces the manual loop with an automated pipeline:
| Step | Manual | Kayba |
|---|---|---|
| Trace review | Engineer reads conversations one by one | Recursive Reflector programmatically analyzes traces via REPL-based code execution |
| Pattern extraction | Engineer mentally notes failure patterns | Skills are extracted as atomic, reusable strategies with helpful/harmful counters |
| Knowledge storage | Scattered in docs, Slack, or the engineer's memory | Skillbook — a transparent, auditable collection of learned behaviors with provenance |
| Prompt updates | Engineer manually edits system prompt | Prompt generation from approved skills, organized by section |
| Quality tracking | Hope and spot-checking | Delta updates with usage counters — every skill tracks its impact |
The Recursive Reflector
The key difference is how trace analysis works. Manual prompt engineering relies on a human reading conversations. Kayba's Recursive Reflector, inspired by Recursive Language Models research, uses a Python REPL sandbox to programmatically explore traces — iterating through inspect, analyze, query, and refine steps via code execution. This catches patterns that surface-level reading misses.
The Skillbook
Instead of a growing, opaque system prompt, Kayba maintains a Skillbook — a structured collection of skills where each entry has:
- A clear description of the learned behavior
- Helpful/harmful/neutral counters from validation
- Provenance tracking (which trace produced this skill)
- Section tags for organized prompt generation
You review and approve skills before they're deployed. The system learns, but you stay in control.
When Manual Prompting Still Makes Sense
Be honest: manual prompting is fine if:
- Your agent handles fewer than ~50 conversations per week
- You have one person who understands every edge case
- Your failure modes are simple and predictable
- You're prototyping, not in production
When to Switch to Kayba
Consider Kayba when:
- You're spending hours reading traces to find failure patterns
- Prompt fixes keep breaking other behaviors
- Multiple people are editing the same system prompt
- Your agent handles 100+ conversations per week
- Failures are costing you money (wrong refunds, missed escalations, bad code suggestions)
- You want a systematic record of what your agent has learned and why
Getting Started
Kayba is open-source (MIT licensed, 2k+ GitHub stars). You can start with the framework alone or use the hosted dashboard for a visual workflow.
pip install ace-framework
The framework analyzes your existing agent traces — markdown, JSON, or plain text — regardless of which LLM provider or agent framework you use. No code changes to your agent required.
- Documentation — Setup guides and API reference
- GitHub — Source code and examples
- Dashboard — Hosted version with visual Skillbook management