Conversation Scenarios

A conversation scenario takes a decision task and removes one convenience: the inputs. The facts still exist, but they are held by a simulated persona, and the operator has to elicit them, turn by turn, before there is anything to decide. That one change is what makes the shape worth building. In real work, the case file does not arrive complete. Someone asked the right follow-up question, noticed the hesitation, built enough trust for the client to mention the thing they were not going to mention. Conversation scenarios measure that layer, with the decision task still sitting at the core.

How the persona holds the facts

The scenario’s state is split by how it can be reached:
  • Known state is what the persona knows about themselves. Each item carries a reveal trigger: some facts are volunteered freely, some surface only under a direct question, and some are shared only after rapport is built. Two operators running the same scenario walk away with different information, because they asked differently.
  • Hidden state is what the persona genuinely cannot know: test results they have not seen, records they cannot access, inconsistencies only a professional would spot. It rewards the operator who investigates rather than accepts.
The persona itself is built from a reusable identity (who they are) and personality (how they communicate), and it reacts to the operator: a dismissive opening produces a more guarded persona, a well-handled moment opens them up. The conversation is generated fresh each session, not scripted.

What you configure

ComponentRole in a conversation
PersonaIdentity, personality, and how readily they disclose
StateKnown state with reveal triggers; hidden state for what only investigation surfaces
BriefingWhat the operator knows walking in, and who speaks first
OutputsThe decision and artifacts the conversation should produce, same as a decision task
CriteriaSuccess and failure metrics, plus scope boundaries (what the operator can do, must escalate, must never do) and the terminology they should use correctly
Scoring covers both layers: did the operator surface what was there to surface, and was the resulting judgment right. A confident decision built on facts the operator never elicited scores differently from the same decision built on a complete picture.

Worked example

A lender wants to know whether loan officers uncover affordability risks during an application call.
  • Persona: a self-employed applicant, friendly and talkative, whose stated income is accurate but recent
  • Known state: changed business structure eight months ago (direct question); a second loan application declined elsewhere last month (rapport required); monthly figures (volunteered)
  • Hidden state: the bank statements, which the operator can request, show irregular deposits that do not match the stated monthly figure
  • Outputs: proceed, decline, or refer to a senior assessor, plus a written application summary
  • Failure metric: proceeds without asking how long the business has operated in its current form
A skilled officer finds the declined application. A rushed one gets a pleasant call and a clean-looking summary, and the score shows exactly which questions were never asked.

Testing a chatbot

If the thing you want to evaluate is itself an AI, the shape does not change. A chatbot evaluation is a conversation scenario with your agent in the operator seat: the agent faces the same persona, the same withheld facts, and the same criteria as your people. That symmetry is the point. “Is the bot ready?” becomes a comparison you can read off a report, not an opinion.

Moving up the ladder

A conversation measures one session. Some work only reveals its quality over time: the plan made in week one either holds or unravels by week six. When the question is longitudinal, extend the conversation into a journey simulation.