The Problem: Manual Tuning Does Not Scale

Every team building AI agents hits the same wall. The agent works well on the first ten scenarios. Then edge cases appear. You rewrite the prompt. It fixes one failure and introduces another. You add more instructions, more examples, more guardrails. The prompt grows to 3,000 tokens, then 5,000, then 8,000. Nobody remembers why half the instructions are there. Every change is a gamble.

This is manual prompt tuning, and it is how most teams improve their agents today. It works at small scale. It falls apart when you have dozens of failure modes, multiple agent types, and a team that needs to ship improvements without breaking what already works.

The good news: manual tuning is not the only option. A growing ecosystem of tools and techniques exists for making agents better over time. The bad news: these approaches differ dramatically in what they require, what they produce, and which problems they actually solve.

This guide covers every major approach to agent improvement. We describe each one fairly, explain the tradeoffs, and help you figure out which fits your situation. Kayba is one option among many here. We think it is the right choice for a specific set of teams, but we will explain why rather than assume it.

The Approaches

1. Manual Prompt Engineering

The baseline that everyone starts with. You run your agent, observe where it fails, and rewrite the system prompt or few-shot examples to fix those failures. This is iterative, human-driven, and entirely manual.

Manual prompt engineering is powerful because it requires no infrastructure, no additional tools, and no training pipeline. You can start immediately with nothing more than an LLM API and a text editor. For simple agents with narrow scope, this is often all you need.

The problems emerge at scale. Every prompt change is a judgment call made by whichever engineer happens to be debugging that day. There is no systematic way to know which failures are most common, whether a fix actually improved overall performance, or whether it regressed something else. Prompt engineering knowledge lives in people's heads, not in a structured system. When the engineer who wrote the prompt leaves the team, the reasoning behind each instruction leaves with them.

Pros:

Zero setup cost. Works immediately with any LLM.
Full human control over every change.
No additional dependencies or infrastructure.

Cons:

Does not scale beyond a handful of failure modes.
No systematic failure analysis. You fix what you happen to notice.
Changes are untraceable. Prompt archaeology becomes a real problem.
Requires deep domain expertise for every edit.
Regression risk increases with every modification.

Best for: Early-stage agents, simple single-task systems, teams with one dedicated prompt engineer and a narrow use case.

2. Fine-Tuning

Fine-tuning adjusts the model's weights directly by training on curated datasets of input-output examples. Instead of telling the model what to do through prompts, you change the model itself. OpenAI, Anthropic, Google, and most open-weight model providers offer fine-tuning APIs or support.

Fine-tuning is the most powerful approach when it works. A well-tuned model can internalize patterns that are difficult to express in a prompt. It can learn formatting conventions, domain-specific reasoning chains, and nuanced decision boundaries that would require thousands of tokens to describe as instructions. Fine-tuned models are also faster at inference because they need shorter prompts.

The cost is significant. You need high-quality labeled training data, typically hundreds to thousands of examples. Collecting and curating this data is expensive. Every time the task changes, you need new data and a new training run. Fine-tuned models can overfit to training data patterns and fail on novel inputs. They are opaque: you cannot inspect what the model learned or audit its reasoning. And you are locked to a specific model version. When the base model is updated, your fine-tune may need to be redone.

Pros:

Highest potential performance ceiling. The model genuinely learns.
Shorter prompts, lower latency, reduced token costs at inference.
Can encode patterns that are impractical to express in prompts.

Cons:

Requires substantial labeled training data (hundreds to thousands of examples).
Training is expensive and slow. Iteration cycles are measured in hours or days.
Opaque. You cannot inspect what the model learned or why it behaves a certain way.
Overfitting risk. Performance can degrade on out-of-distribution inputs.
Model lock-in. Switching providers means retraining from scratch.
Catastrophic forgetting. Improving one capability can degrade others.

Best for: High-volume, stable tasks where the input/output patterns are well-defined and you have the data to support training. Classification, extraction, and formatting tasks where consistent outputs matter more than flexible reasoning.

3. DSPy (Programmatic Prompt Optimization)

DSPy, developed at Stanford, takes a programming-languages approach to prompt optimization. Instead of writing prompts by hand, you define your agent as a program with typed signatures (input/output specifications), and DSPy's optimizers automatically search for the best prompt instructions and few-shot examples.

The core idea is compelling: treat prompt engineering as a compilation problem. You write the logic, define a metric, provide training examples, and the optimizer figures out the right prompt. DSPy supports multiple optimization strategies including bootstrap few-shot (selecting examples that maximize your metric), MIPRO (instruction proposal and selection), and MIPROv2 (multi-stage optimization with Bayesian hyperparameter tuning).

DSPy works well for pipelines where you can define a clear metric and have labeled data to optimize against. It is particularly strong at few-shot example selection, which is one of the highest-leverage prompt engineering techniques. The programmatic approach also makes prompt changes reproducible and version-controllable.

The limitations are real. DSPy requires you to restructure your agent as a DSPy program, which means adopting its abstractions (Modules, Signatures, Optimizers). If your agent is built on LangChain, CrewAI, or a custom framework, integration is nontrivial. Optimization runs can be expensive, requiring many LLM calls to search the prompt space. And DSPy optimizes individual modules, not the end-to-end behavior of multi-step agents where failures cascade across steps.

Pros:

Automated prompt optimization. Removes the human bottleneck.
Reproducible. Prompt changes are versioned and deterministic.
Strong few-shot example selection.
Active research community and Stanford backing.

Cons:

Requires restructuring your agent into DSPy's programming model.
Optimization runs are expensive (many LLM calls per run).
Metric-dependent. You need a well-defined, computable evaluation metric.
Optimizes individual modules, not holistic agent behavior.
Learning curve. The abstraction layer adds complexity.

Best for: Teams comfortable with a programmatic approach who have clear metrics and are building modular pipelines. Particularly strong for retrieval-augmented generation and multi-hop QA systems.

4. Evolutionary and Population-Based Optimization

Approaches like EvoPrompt and GEPA (Genetic Evolution of Prompts for AI) apply evolutionary algorithms to prompt optimization. The idea: generate a population of prompt variants, evaluate them against a fitness function, and use selection, crossover, and mutation to evolve better prompts over generations.

GEPA specifically uses LLMs as the mutation and crossover operators. Instead of random string manipulation, it asks the LLM to combine successful elements from high-performing prompts and introduce variations. This leverages the LLM's understanding of language to make intelligent modifications.

Evolutionary approaches have a theoretical advantage: they explore a much larger space of prompt variations than any human could. They can discover non-obvious phrasings and instruction combinations. And they are naturally parallel, making them suitable for batch evaluation.

In practice, evolutionary optimization is computationally expensive. Each generation requires evaluating an entire population of prompts, and you typically need many generations. The fitness function must be fully automated, which means you need either labeled test data or a reliable automated evaluator. The resulting prompts can be difficult to interpret, containing phrases that work but that no human would have written. This makes them hard to maintain and debug.

Pros:

Explores a vast prompt space. Can find non-obvious improvements.
Naturally parallel. Scales with compute.
Does not require human prompt engineering expertise.

Cons:

Very expensive. Many evaluations across many generations.
Requires a fully automated evaluation function.
Resulting prompts may be unreadable and unmaintainable.
Slow iteration cycles. Evolution takes time.
Limited to prompt-level changes. Cannot discover new agent architectures or workflows.

Best for: Research settings or high-value production systems where you can afford extensive compute for optimization and have a robust automated evaluation pipeline.

5. LLM-as-Judge Evaluation

LLM-as-Judge is less of an improvement method and more of an evaluation layer, but many teams use it as the foundation for their improvement process. The approach: use a separate LLM to evaluate your agent's outputs against criteria like correctness, helpfulness, safety, or adherence to instructions.

Frameworks like Braintrust, Langfuse, and Patronus provide infrastructure for running LLM-based evaluations at scale. You define evaluation criteria, run your agent on test cases, and the judge LLM scores each output. Over time, you build a dataset of scored examples that reveals systematic patterns.

LLM-as-Judge is valuable because it scales evaluation beyond what humans can review manually. A human might review 50 agent outputs per day. An LLM judge can evaluate thousands. This makes it practical to run evaluations on every deployment, catching regressions that manual review would miss.

The limitation is that LLM-as-Judge tells you what is wrong but does not fix it. It is a diagnostic tool, not a treatment. You still need a human (or another system) to translate evaluation results into prompt improvements. LLM judges also have well-documented biases: they prefer longer outputs, are sensitive to output order, and can be inconsistent across runs. Calibrating a judge to match human preferences is its own engineering challenge.

Pros:

Scales evaluation far beyond human review capacity.
Systematic. Every output is evaluated against consistent criteria.
Catches regressions on deployment.
Integrates with CI/CD pipelines.

Cons:

Diagnostic only. Identifies problems but does not fix them.
Judge biases (verbosity preference, position bias, inconsistency).
Requires careful calibration against human judgments.
Expensive at scale (every evaluation is an LLM call).
Does not produce actionable improvement steps.

Best for: Teams that need systematic evaluation and already have a process for translating evaluation results into improvements. Excellent as a complement to other approaches on this list.

6. Memory-Based Improvement

Memory systems like Mem0, Zep, and Letta give agents the ability to store and retrieve information across sessions. While primarily designed for personalization and context continuity, some teams use memory as an improvement mechanism: store successful approaches, retrieve them in similar future situations, and hope the agent performs better.

The idea has intuitive appeal. If the agent successfully resolved a complex refund case last Tuesday, storing that interaction and retrieving it when a similar case appears should help. And for personalization tasks, memory is genuinely the right tool. An agent that remembers a customer's preferences will outperform one that starts fresh every time.

Where memory falls short as an improvement strategy is in the gap between recall and learning. Storing a successful interaction gives the agent an example, but it does not teach the agent why that approach worked or how to adapt it to variations. Memory also grows without curation. After thousands of interactions, the memory store contains successful examples, failed examples, outdated information, and contradictory approaches. Without a mechanism to distinguish signal from noise, more memory can actually degrade performance by filling the context window with marginally relevant information.

For a deeper exploration of this distinction, see our guide on memory vs learning for AI agents.

Pros:

Enables personalization and context continuity.
Simple to implement. Many plug-and-play options.
Valuable for customer-facing agents that need session history.

Cons:

Stores what happened, not what to do differently.
Memory grows without curation. Signal-to-noise ratio degrades over time.
No mechanism for extracting generalizable lessons.
Can fill context windows with marginally relevant information.
Does not analyze failures or identify improvement patterns.

Best for: Personalization, context continuity across sessions, and agents that need to recall user-specific information. Not a substitute for a genuine improvement system.

7. Trace-Based Learning (Kayba)

Kayba takes a different approach from all of the above. Instead of optimizing prompts directly (like DSPy or evolutionary methods) or storing raw interactions (like memory systems), Kayba analyzes execution traces to extract structured, transferable skills and generates improved prompts from those skills.

The pipeline works in three stages. First, trace analysis examines how your agent handled real interactions, identifying patterns of success and failure across many executions. Second, skill extraction distills those patterns into structured Skillbook entries: reusable knowledge about how to succeed at specific types of tasks, complete with the reasoning behind each skill. Third, prompt generation translates Skillbook entries into improved system prompts that change your agent's behavior.

What makes this approach distinct is transparency. Every improvement is traceable. You can inspect the Skillbook to see exactly what the system learned, why it learned it, and which traces contributed to each skill. Skills are human-readable and editable. If the system extracts a skill you disagree with, you can modify or remove it. This is fundamentally different from fine-tuning (where learned behavior is locked in opaque weights) or evolutionary optimization (where the resulting prompts may be uninterpretable).

Kayba is framework-agnostic. It works with any agent framework (LangChain, CrewAI, custom implementations) because it operates on traces, not on framework-specific abstractions. It does not require labeled training data, a custom evaluation metric, or restructuring your agent. You point it at your traces, and it learns.

The research foundation is the Reflective LLM Meta-Strategy (RLM), which demonstrated that agents using structured trace analysis and skill extraction consistently outperform those using raw experience or no learning at all. The tau2-bench benchmark results showed improvements of 38-95% across different LLM backends.

Pros:

Automated improvement from real production traces. No labeled data required.
Fully transparent. Every skill is inspectable, editable, and auditable.
Framework-agnostic. Works with any agent architecture.
No fine-tuning. Improvements live in prompts, not model weights.
Research-backed with published benchmark results.
Open-source core. No vendor lock-in.

Cons:

Requires sufficient trace volume for meaningful pattern extraction.
Prompt-level improvements have a ceiling compared to fine-tuning.
Newer tool. Smaller community than established alternatives.
Skill quality depends on trace quality. Garbage in, garbage out.

Best for: Teams running agents in production who want automated, transparent improvement without fine-tuning or restructuring their agent architecture. Particularly strong for complex, multi-step agents where failure patterns are hard to identify manually.

Comparison Table

Dimension	Manual Prompt Engineering	Fine-Tuning	DSPy	Evolutionary (GEPA)	LLM-as-Judge	Memory-Based	Kayba (Trace-Based)
Setup cost	None	High (data + training)	Medium (restructure)	Medium (eval pipeline)	Medium (judge setup)	Low	Low (point at traces)
Ongoing cost	High (human time)	High (retraining)	Medium (optimization runs)	Very high (compute)	Medium (eval calls)	Low	Low (automated)
Labeled data needed	No	Yes (hundreds+)	Yes (some)	Yes (eval set)	Yes (calibration set)	No	No
Transparency	Full (human-written)	None (opaque weights)	Partial (selected examples)	Low (evolved prompts)	High (scored outputs)	High (stored interactions)	Full (auditable Skillbook)
Framework lock-in	None	Model-specific	DSPy framework	Eval framework	Eval framework	Memory provider	None
Improvement type	Prompt edits	Weight changes	Prompt optimization	Prompt evolution	Diagnosis only	Context augmentation	Skill extraction + prompts
Handles novel failures	If noticed by human	If in training data	If captured by metric	If in eval set	Yes (detects)	Retrieves similar past cases	Yes (from trace patterns)
Scales with agent count	Poorly	Poorly (per-model)	Moderately	Moderately	Well	Well	Well
Multi-step agent support	Manual analysis	Limited	Module-level	Prompt-level	End-to-end scoring	Session-level	Trace-level (end-to-end)
Time to first improvement	Minutes	Days	Hours	Hours to days	Hours (diagnosis only)	Minutes	Hours
Maintenance burden	High (prompt archaeology)	High (data + retraining)	Medium (metric upkeep)	Medium (eval upkeep)	Medium (judge calibration)	Low to medium	Low (automated curation)

Which Approach Fits Which Situation

You should use manual prompt engineering if you have a simple agent with a narrow scope, your team has one person who understands the domain deeply, and you are not yet at the scale where systematic improvement matters. Start here. Move to something else when you feel the pain.

You should use fine-tuning if you have a high-volume, stable task with well-defined input/output patterns and the resources to collect and maintain training data. Classification, extraction, and formatting tasks are the sweet spot. If your task definition changes frequently, fine-tuning will be a treadmill.

You should use DSPy if you are building modular LLM pipelines, you are comfortable adopting a new programming framework, and you have clear metrics to optimize against. DSPy is a strong choice for retrieval-augmented generation systems and structured multi-hop reasoning.

You should use evolutionary optimization if you are in a research setting or have a high-value production system where the compute cost of optimization is justified by the performance gains. You need a robust automated evaluation pipeline.

You should use LLM-as-Judge if you need systematic evaluation at scale and already have a process for acting on evaluation results. It pairs well with every other approach on this list. It is a diagnostic tool, not a standalone improvement method.

You should use memory-based tools if your agent needs personalization, session continuity, or user-specific context. Memory solves a real problem. Just do not expect it to make your agent smarter at its core tasks.

You should use Kayba if you are running agents in production, you want automated improvement without fine-tuning, and you care about transparency. Kayba is particularly strong when you have complex agents with many failure modes and you need a systematic way to identify, understand, and fix them. It works alongside your existing framework and does not require labeled data or restructuring.

Combining Approaches

These approaches are not mutually exclusive. The most effective agent improvement stacks combine multiple layers:

LLM-as-Judge + Kayba: Use LLM judges to score outputs systematically, then use Kayba to analyze the failure patterns and generate improvements. The judge identifies problems. Kayba fixes them.
Memory + Kayba: Use memory for personalization and session context. Use Kayba for behavioral improvement. They solve different problems and complement each other cleanly.
DSPy + LLM-as-Judge: Use DSPy to optimize individual pipeline modules and LLM judges to evaluate end-to-end performance.
Manual engineering + any automation: Start with manual prompts, then layer on automated improvement once you understand the problem space. No tool replaces understanding your domain.

Why Kayba for the Middle Ground

Many teams are stuck between two extremes. Manual prompt engineering works but does not scale. Fine-tuning scales but requires resources and expertise that most teams do not have. DSPy and evolutionary approaches require restructuring your agent or building evaluation infrastructure.

Kayba occupies the middle ground. It automates the improvement process without requiring training data, without locking you into a framework, and without hiding what it learned behind opaque weights. The Skillbook gives you a transparent, auditable record of your agent's accumulated expertise. Every skill is traceable back to the real interactions that produced it.

For teams that want to move beyond manual tuning but are not ready for (or do not need) fine-tuning, trace-based learning offers a practical path. You keep your existing agent architecture, point Kayba at your traces, and get systematic improvements with full visibility into what changed and why.

Getting Started

Kayba is open-source and available on PyPI:

pip install ace-framework

For a hosted experience with a visual dashboard, trace storage, and team collaboration features, visit the Kayba Console.

If you want to see how Kayba compares on real benchmarks, the tau2-bench results show performance improvements across multiple LLM backends. For a deeper understanding of the trace analysis and skill extraction process, the context engineering guide covers the research foundations.

Whichever approach you choose, the important thing is to move beyond ad-hoc prompt editing. Your agents deserve a systematic improvement process. The right tool depends on your constraints, your scale, and your team. This guide should help you make that choice with clarity.