Agent side-effect regression

Freeze. Diff. Gate.

Side-effect regression testing for AI agents.

Run Demo → View Trace →

Trace captured

FAILED 3/6 PASSED 6/6

Trace timeline

11 sec replay

00:00 00:02 00:04 00:07 00:09 00:11

Tool calls

search_orders200 OK

get_customer200 OK

create_refund500 ERR

send_email200 OK

update_crm500 ERR

log_event200 OK

Side-effect diff

Resource Expected Actual Diff

refunds - id: rf_123 + CREATED

emails - to: customer@email.com + SENT

orders status: open status: refunded CHANGED

crm.contacts last_contacted: - last_contacted: now CHANGED

audit_logs - +1 entry + CREATED

The problem

Production failures should not become folklore.

Freeze the exact run, preserve the state boundary, and turn the incident into a gate.

unreproducible

Wrong refund

Tool: stripe.refund
Target: in_original
State: changed

side effect

Email sent

Tool: gmail.send
Approval: missing
Action: delivered

missed approval

CRM drift

System: HubSpot
Order: too early
Replay: missing

3 / 8 workflow

A release gate built from real incidents.

Capture the production run, compare the candidate, and block regressions before release.

1

Freeze

Capture the run

agentreplay freeze

2

Diff

Compare side effects

agentreplay diff

3

Gate

Block regressions

agentreplay gate

Bad trace

FAILED 3/6

No PII in logs
Email sent
Refund within 24h

Fixed trace

PASSED 6/6

No PII in logs
Email drafted
Refund within 24h

Tool diffSide-effect diffStatus

gmail.send → gmail.draftemail side effect removedchanged

stripe.refundrefund target changedverified

hubspot.updateorder preservedpass


agentreplay diff bad_trace.json fixed_trace.json

Section 4 / 8

The incident becomes the test.

Replay the bad run, compare the candidate, and block the regression before release.

Inspect proof →

3/6 failed checks 2 changes detected 6/6 passing checks

5 / 8

Wrap the tools your agents already use.

Record inputs, responses, approvals, and side effects without replacing your agent framework.

AgentReplay
Harness

Stripe

Gmail

HubSpot

GitHub

Postgres

Slack

OpenAPI


harness.wrapTool('stripe.refund', refundCustomer)

6 / 8

Built for agents that act.

Billing, support, RevOps, and platform teams need proof before the next release.

$

Billing Ops

refund gate

invoice targetPASS amount checkPASS email actionFIXED

✉

Support Automation

draft approval

Approve Reject

{ }

AI Platform

CI release gate

∴

Agent Agencies

client proof bundle

ClientStatusTests

Northwinddelivered21/24

Atlasreviewed18/18

SDK CLI READY

import { createHarness } from 'agentreplay'

const harness = createHarness({
  projectKey: process.env.AGENTREPLAY_PROJECT_KEY,
  redact: ['pii', 'raw_keys']
})

const refund = harness.wrapTool(
  'stripe.refund',
  refundCustomer
)


agentreplay gate traces/billing-bad-run.json PASSED 6/6


agentreplay diff bad.json fixed.json DIFF 3

Developer experience

Drop it around the tools. Keep your agent.

AgentReplay works beside your framework: OpenAI Agents, LangGraph, custom MCP servers, or your own loop.

No framework rewrite
Deterministic gates
Redaction by default
CI-ready

               ······:::::::···              
        ···::::::++++++*****+++:::····       
    ··::++++:::TRACE CAPTURED:::::++:::···   
 ··::+++::····                  ·····::+::···
··:++::···                          ····:+:::
::+::··                                ···:+:
:+::··                                  ···:+
+::··        FREEZE  DIFF  GATE          ··:+
+::··                                    ··:+
++::·                                    ·::+
+++::·           PASSED 6/6             ··:+*
:+*++:·                                ·::+*+
·::+++::··                          ···::+++:
 ··::+++::····                  ·····:::::···
    ···:::+:::·················::::::·····   
        ·······:::::+++++++++:::······

✓ PASSED 6/6

Ship agents with evidence.

Every side effect frozen. Every fix compared. Every release gated.

Start Freezing Traces → Read the Proof →