Any scenario can be completed by a person or by an artificial intelligence (AI) agent in the operator seat. This page covers the agent side: configuring an agent, choosing a training tier, and running benchmarks. Because the agent runs the same scenarios and is scored against the same criteria as your people, the comparison between them is direct.

Configuring an agent

An agent is a saved configuration with four parts:
SettingWhat it is
NameA label so you can tell agents apart in run results
Model sourceA predefined model, or a custom endpoint (see below)
System promptWritten instructions that define how the agent behaves: the role it plays, how it decides, and what it refuses to do
Sampling settingsControls over how varied the model’s responses are
For the model source you choose one of two options:
Pick a model from the providers available in the app. The available list evolves, so check the agent creation form for current options.
Agent creation requires admin access. See roles and members.

Training tiers

Agents progress through three tiers, each building on the expert sessions you have captured:
  • Prompt - you write a system prompt yourself; self-serve and available now.
  • Optimized - your prompts are refined against your captured expert data; a bespoke service.
  • Custom - a model fine-tuned on your expert data; a bespoke service.
Everyone starts at the Prompt tier. Run benchmarks there first: the gap between your prompted agent and your experts tells you whether the higher tiers are worth pursuing, and the expert sessions you capture along the way (see running scenarios with your people) are the raw material the higher tiers are built from.

Benchmark runs

A benchmark run executes an agent against one or more scenarios and scores the results.
1

Start a run

Pick an agent and the scenarios to test. The run is queued and executes asynchronously, so you can start it and come back later.
2

Watch it progress

The run moves through statuses: queued, running (agent-vs-persona sessions in progress), scoring (sessions finished, scoring against the scenario’s criteria), and completed. Runs that hit an error show failed, and you can cancel a run manually. Each scenario’s progress updates individually.
3

Review the results

For each scenario you get the full session transcript, a scoring breakdown showing which criteria were met and why, and a cost summary for the run.
Because the scenario versions and criteria are the same ones your operators face, you can put agent results next to cohort results and read them on one scale. See Results.

Testing a chatbot

If the thing you want to evaluate is itself a conversational AI product, you do not need anything special: that is a conversation scenario with your agent in the operator seat. The persona plays the customer, your chatbot plays the operator, and the session is scored against the same criteria you would apply to a human agent handling the conversation. The comparison between your chatbot and your people is direct because nothing about the measurement changes, only who is in the seat.

Run scenarios with your people

Capture expert sessions and measure your team at scale.

Scenario types

Decision tasks, conversation scenarios, and journey simulations.