VerticalAI docs
Guides

Evals

Graded, repeatable tests — scripted conversations scored by an LLM judge — so you can trust a change before it ships.

The test panel lets you talk to the agent yourself. Evals let you do that repeatably and at scale: each eval is a scripted conversation that gets played against the agent and graded by a language-model judge. They are how you move from "it worked when I tried it" to "it works every call".

How an eval works

Each eval scenario plays a scripted conversation against your agent and grades the result with an LLM judge. The judge decides whether the agent did what the scenario required — handled the caller's intent, called the right tools, and reached a satisfactory end.

Why you need them

A single green run proves nothing. Voice agents are non-deterministic; the same prompt can pass once and fail the next time. Evals give you:

  • Repeatability — run the same scenario again after a change and compare.
  • Breadth — cover the awkward cases (caller changes their mind, gives a bad detail, asks something out of scope) that you would not think to retry by hand every time.
  • Confidence — trust a result that holds across several runs, never a single pass.

Treat evals as the gate: write a failing scenario for the behaviour you want before you change the prompt or a tool, then make it pass.

Reading the results

Beyond pass/fail, the eval dashboards surface the voice-quality metrics that decide whether a call feels right:

  • Latency and time-to-first-byte — how quickly the agent starts responding.
  • Interruptions and double-speak — the agent and caller talking over each other.
  • Filler rate — how often the agent uses filler phrases to mask a wait.

These turn "the call felt off" into a number you can chase.

Billing

Eval scenarios bill on the same rule as live calls: $0.65 per scenario that resolves and passes the judge. Scenarios that fail or never resolve are free. See Billing and credits for the full table.

On this page