The Problem with Browser Agents

Browser agents — tools that navigate websites, fill forms, click buttons, and complete tasks on behalf of users — are one of the most promising and most frustrating categories of AI agents.

They fail in ways that are hard to predict and harder to fix:

Navigation errors — clicking the wrong element, getting lost in multi-step flows, failing to find buttons that moved after a UI update
Form-filling mistakes — entering data in wrong fields, missing required fields, using wrong date formats
State management failures — losing context during multi-page workflows, not recognizing when a page has changed
Timing issues — acting before a page loads, missing dynamic content, failing on slow connections
Recovery failures — unable to recover from error states, popup dialogs, or unexpected redirects

These failures are consistent. The same agent makes the same navigation mistake on the same website every time, because it has no mechanism to learn from previous attempts.

Measured Results

τ2-bench Benchmark (Sierra Research)

Measured on τ2-bench, a benchmark that challenges agents to coordinate with users across complex enterprise domains. Kayba ran offline trace analysis on past runs, extracted strategies, and appended them to the agent's policy. The improvement grows with stricter consistency requirements:

Metric	Baseline	With Kayba	Improvement
pass@1	41.2%	52.5%	+27.4%
pass@2	28.3%	44.2%	+56.2%
pass@3	22.5%	41.2%	+83.1%
pass@4	20.0%	40.0%	+100.0%

The pattern is clear: the more consistently an agent needs to perform, the bigger Kayba's impact. At pass@4 (must succeed 4 times in a row), Kayba doubles the baseline.

Browser Agent Field Results

In real-world browser agent deployments:

Success rate: 30% → 100% (from failing most tasks to completing all of them)
Efficiency: 82% fewer steps to complete tasks
Cost: 65% lower token costs

These gains come from the Skillbook accumulating site-specific navigation strategies, form-filling patterns, and failure recovery procedures that the agent reuses across sessions.

How Kayba Improves Browser Agents

Kayba adds a learning layer on top of your browser agent. It analyzes execution traces — the full record of what the agent saw, what it clicked, what happened next — and extracts reusable skills that prevent the same failures from recurring.

The Pipeline

Trace collection: Your browser agent logs its actions — screenshots, DOM states, clicks, form inputs, navigation events, success/failure outcomes.
Recursive analysis: The Recursive Reflector programmatically analyzes these traces via REPL-based code execution. For browser agents, this means examining sequences of actions, identifying where decisions went wrong, and understanding the relationship between page state and agent actions.
Skill extraction: Navigation patterns and failure avoidance strategies are extracted as skills:
- "On checkout pages, always wait for the payment form to fully render before interacting — look for the card number field as the ready signal"
- "When a cookie consent dialog appears, dismiss it before attempting any other interaction"
- "For date pickers on this site, click the calendar icon rather than typing directly — the text input rejects programmatic input"
- "After form submission, verify the confirmation page loaded before marking the task as complete"
Skillbook growth: Skills accumulate across browser sessions. The Skillbook becomes a growing knowledge base of navigation strategies, site-specific quirks, and failure recovery patterns.
Prompt generation: The updated system prompt gives the browser agent pre-loaded knowledge about how to handle common scenarios, site-specific patterns, and recovery strategies.

What Changes in Practice

Before Kayba: Your browser agent encounters a cookie consent popup on a new site. It doesn't know what to do, clicks the wrong element, and the rest of the task fails. Next time, same failure.

After Kayba: The Skillbook contains a skill about handling consent dialogs — learned from previous traces where this was a failure mode. The agent's system prompt now includes instructions for detecting and dismissing common popup patterns before proceeding.

Browser Agent Failure Categories

Kayba's analysis has identified common failure categories across browser agents:

Navigation Failures

The agent can't find the right element to click, navigates to the wrong page, or gets stuck in loops. These are often caused by dynamic UIs, A/B tests, or layout changes.

Kayba's approach: Extract site-specific navigation skills: "Use the search bar to find products rather than navigating category menus, which change frequently."

State Management Failures

The agent loses track of where it is in a multi-step process, or doesn't recognize that a page has changed state (e.g., an item added to cart but the cart count didn't update).

Kayba's approach: Build verification skills: "After adding an item to cart, verify the cart count increased before proceeding to checkout."

Input Handling Failures

Form fields are filled incorrectly — wrong format, wrong field, or missed validation. Date pickers, dropdowns, and auto-complete fields are especially problematic.

Kayba's approach: Site-specific input skills: "This site's phone number field requires country code prefix. Use +1 format for US numbers."

Recovery Failures

When something goes wrong — an error message, a timeout, an unexpected redirect — the agent doesn't know how to recover and either fails or makes the situation worse.

Kayba's approach: Recovery skills: "If the payment page shows a 'session expired' error, navigate back to the cart page and restart checkout rather than refreshing."

Integration

Kayba works with any browser agent framework:

browser-use — Direct integration, the framework Kayba was tested with on tau2-bench
Playwright-based agents — Any agent using Playwright for browser automation
Puppeteer-based agents — Same approach, different automation library
Computer-use APIs — Claude's computer use, GPT-4o with vision, or custom implementations
Custom browser agents — If your agent produces action logs, Kayba can analyze them

pip install ace-framework

Kayba analyzes traces offline. It doesn't intercept your browser agent's runtime — it learns from completed sessions and improves the system prompt for future runs.

Why This Vertical Matters

Browser agents are where AI agents meet the messiest, most unpredictable environments — real websites built by humans, full of edge cases, dark patterns, and inconsistencies. No amount of upfront prompt engineering can anticipate every site's quirks.

This is exactly the scenario where learning from experience matters most. A browser agent that has navigated a particular site ten times and built a Skillbook of site-specific strategies will dramatically outperform one running on a generic system prompt.

Kayba's results demonstrate this: from 30% to 100% success rate on real browser tasks, up to 2x consistency improvement on τ2-bench, with 82% fewer steps and 65% lower token costs. The learning layer turns inconsistent browser agents into reliable, efficient ones.

Getting Started

Documentation — Setup guides and API reference
GitHub — Source code, examples, and tau2-bench reproduction
Dashboard — Hosted dashboard for visual Skillbook management