Fix your agent.
Know it worked.

Most fixes ship on a hope. Kayba catches failures, answers when you ask, and proves the fix actually worked — all from your real failure patterns.

# agent-alerts
KaybaApp9:41
Caught a new failure mode
refund_lookup fails for guest checkout — wrote an eval
See the evalOpen PR
KaybaApp
route=guest skips the customer_id lookup, so the policy check never runs. Here's the trace:
trace · tr_8a2
06 route · guest
07 customer_id · skipped
08 refund_policy · never ran
KaybaApp9:44
Fix merged.Eval pass rate 32% 73%↑ 41 pts

Every change comes with proof it worked.

An error lands in your observability stack. Kayba turns it into a custom eval, proposes a fix, and tracks how it performs after it ships. Want to dig in? Just ask — it surfaces failure modes you'd never write an eval for, and answers with the cited trace.

KaybaApp9:41 AM
Fix verified

Retry timed-out tool calls instead of failing the run

When a tool call timed out, the agent stopped the whole run instead of retrying. Kayba's patch retries the call. Custom eval from this failure: passing. Existing suite: 16/16.

agent/tools.py·eval 16/16
Review the fixOpen PR
3 replies · last reply 2 min ago
The problem

You fix your agent blind.

01

You ship fixes on a hope

You tweak a prompt, change a tool, redeploy, then wait to see if the error comes back. No proof the fix worked, just a gut feeling.

02

Dashboards show charts, not answers

Error rate spiked. You can see that something broke. You still can’t see why, or whether your last change is what broke it.

03

Evals only catch what you defined

Semantic failures, edge-case regressions, unknown unknowns. None of them show up in a suite you wrote in advance, until a customer hits them first.

How it works

From failure to verified fix in 8 minutes.

Connect Kayba once. After that, every failure becomes an eval, every fix is verified against your suite, and every change is tracked over time. Here's one real session, start to finish.

setup · 1 min

One minute to set up

Point Kayba at wherever your traces and errors already live: Sentry, PostHog, OpenTelemetry. Then it just listens.

coding agent
>
t+00:00

Every failure becomes an eval

When an error lands, Kayba pulls out all the context (the trace, the error, the code) and turns the failure into a custom eval, a reproducible test grounded in your real failure patterns.

error
trace
code
kayba
kayba
eval
t+00:42

Ask why. Get the cited trace.

Dig into any failure in plain English, from Slack or your terminal. Kayba answers with the root cause and the trace to prove it, so you’re never guessing.

# agent-alerts
KaybaApp
t+03:12

Kayba proposes a fix. You decide.

Every fix lands as a PR with the trace and the error attached. Merge it, change it, or write your own. Either way, the eval keeps checking whether the failure comes back.

Retry timed-out tool calls#142
kayba:fix/tool-retry → main
open
agent/tools/router.py
eval tool_timeout · running live
you decide
Change itWrite your own
t+08:00

Proof it worked

As new traces come in, the fix keeps getting checked against its eval. Watch the pass rate climb, spot regressions the moment they happen — under eight minutes from failure to verified fix.

eval pass rate
0%
May 07fix shipped ↑May 14
Plug into your stack

Use the trace storage you already have.

They already store your traces and errors. Kayba reads from them, turns failures into evals, and tells you whether the fix held.

LangfuseLangSmithMLflowOpenTelemetrySentryPostHog

Ready to fix your agents?

Frequently asked questions.