For AI agent companies serving many customers

Custom for every customer. Better every week.

Kayba finds what broke for each customer before they complain, drafts the fix, and tests it against every other customer. You approve what ships into that customer’s playbook.

Built by researchers from
The problem

Your customers find the bugs before you do.

You run one agent for many customers. Issues hide in traces nobody can read, every customer wants different behaviour, and your best people end up debugging the agent instead of growing the business.

01
Trace log · Tue200k lines today
9:414:12
9:472:58
10:036:41
10:183:05
10:265:33

What broke is buried in traces

One person on the team can find an issue in ten minutes. For everyone else it takes hours, so mostly nobody looks.

02
Customer asksweek 23
ACMENever disclose it’s an AIlive
BOLTAlways disclose it’s an AIlive
CORALNo refunds w/o sign-offqueued
+18exception requeststhis week

Every customer wants a different agent

Conflicting demands, exception lists, hand-tuned prompts per tenant. Today this scales one way: hiring.

03
Blast radius23 customers

“I fix this, something else breaks”

A tweak for one customer regresses another. So uncertain changes don’t ship, and known issues stay.

The solution

Kayba finds it, explains it, drafts the fix.
You keep the pen.

The same loop on every trace, for every customer. Approved fixes land in that customer’s playbook and compound.

01

Know what broke before the customer does

Today

Issues surface as customer complaints. Then someone reconstructs what happened from 200,000 lines of logs.

With Kayba

Kayba reads every trace as it lands and flags failures with the evidence attached: which customer, how often, trending or not. Sentry, but for agent behaviour.

Live issues
All customers
  • Acme · refund promised outside policyfound by kayba12×
  • Northpeak · AI status disclosed against policyyour check
  • CoLearn · same answer loopingfound by kayba
  • Relay · outdated rate card quotedyour check
Found by Kayba · needs your review
Refunds promised on orders past the 30-day window
38 traces · Acme Logistics · first seen 2h ago
02

Get the fix drafted, never auto-applied

Today

Diagnosis is bottlenecked on the one person who can read the traces. Everyone else queues behind them.

With Kayba

Kayba pinpoints the root cause and drafts the fix in plain English. It never edits your agent on its own. Your team keeps the pen.

kayba-triage
Acme Logistics
S
SarahCSM2:14 PM

@kayba Acme says the agent promised a full refund on a 60-day-old order. Can you scope?

K
KaybaApp2:15 PM
Kayba is typing
Scoping

Found 4 traces in the last 24h matching the complaint. Three ended with an out-of-policy promise.

Root cause

The agent checks the refund amount, but never the order date.

Proposed fix

New rule for Acme’s playbook: confirm the order is within the 30-day window before promising a refund. Ready to test against every customer’s checks.

Message #kayba-triage
03

Ship it without fearing the rest

Today

“I fix this, and something else gets broken.” So uncertain changes don’t ship, and known issues stay.

With Kayba

Every expectation becomes a named check. Each fix runs against the customer’s checks, and every other customer’s, before it ships.

Named checks
Acme Logistics
  • Follows the 30-day refund policy0/24 traces
  • Answers only from the knowledge base0/24 traces
  • Offers a human when asked0/24 traces
  • Confirms identity before account changes0/24 traces
  • Never invents discounts0/24 traces
Pre-ship run · Acme + 22 others0 of 5 passing
04

Every approved fix compounds into the playbook

Today

Per-customer behaviour lives in hand-tuned prompts and tribal knowledge. Scaling it means hiring.

With Kayba

Approved fixes become rules in that customer’s playbook. Every Friday, Kayba drafts the report you send them as proof: issues caught, rules added, score up.

Customer playbook
Acme Logistics
  • Confirm order date before refundsfrom 38 traces
  • Never quote delivery dates from memoryfrom 12 traces
  • Escalate after 2 failed answersfrom 47 traces
  • Offer a human handoff on frustrationfrom 9 traces
  • Address customers by first namerequested by Acme
Eval score since April60% → 78%
Weekly report · ready
Tuesday, 2:14 PM

A complaint becomes a shipped fix in five minutes.

The same complaint used to mean hours of log-digging by the one person who can read them. Here’s the whole loop, timestamped.

2:14 PM
Complaint lands

“The agent promised a refund we don’t offer.” Acme, in your Slack.

2:15 PM
Kayba scopes it

4 matching traces found, evidence attached.

2:16 PM
Root cause, in English

The agent never checks the order date.

2:18 PM
You approve the fix

Suggested, never auto-applied, and tested against every customer first.

2:19 PM
Shipped to Acme only

One new rule in Acme’s playbook. Nobody else’s behaviour moves.

On Friday, Acme’s weekly report shows the issue, the rule, and the eval score going up. The playbook isn’t just your tooling. It’s the proof you show each customer that their agent is getting better.

0%
eval pass rate, up from 60%
a customer measured it themselves, on their own evals
0 min
from complaint to shipped fix
finding the trace alone used to take 30 min to 2 hours
0%
of traces checked against every playbook
not the 2% a human can spot-check
0
changes applied without your approval
Kayba suggests; your team keeps the pen
Research

Built by researchers. Verified on benchmarks
and production.

The team behind Kayba published the agent-learning research this product is built on, and benchmarked it in the open on τ2-bench before asking anyone to trust it with their customers.

Works with your stack

Kayba reads traces from Langfuse, MLflow, Logfire, or any OpenTelemetry source. No SDK rewrite, no exports. Your data stays where it is.

Compliance
SOC 2Underway
Benchmark
τ2-bench · Claude Haiku 4.5
BaselineKaybaImprovement
pass^141.2%55.3%+34.2%
pass^228.3%44.2%+56.2%
pass^322.5%41.2%+83.1%
pass^420.0%40.0%+100.0%

τ2-bench is a real-world agent benchmark by Sierra Research.

The payoff

Stop debugging your agent.
Get back to growing the business.

“Every day I get at least 200,000 lines of logging. It’s impossible to go through all that.”
CTO building Agents

Kayba takes the trace-reading, the regression anxiety, and the per-customer tuning. Your customers see proof every Friday that their agent is improving, and your week goes to onboarding the next customer instead of debugging the last one. Bring a week of traces and we’ll show you. 20 minutes, no slides.