Quickstart
This walkthrough takes you from an empty organization to your first scored session. Plan on building one small scenario and running it yourself before involving anyone else.Create a role
A role, like Claims Adjuster or Case Manager, is the container that organizes your scenarios, agents, and benchmarks. Pick the job whose judgment you want to capture and create a role for it. Everything else in this walkthrough happens inside that role. See roles and members.
Build your first scenario
Start small. A decision task is the fastest first scenario: present the operator with a set of inputs, such as a claim file or an application, and ask for one judgment. If your domain is conversation-driven, build a simple conversation scenario instead: a persona holds the facts, and the operator has to ask for them.Either way, define at least one success metric (what the operator should accomplish) and one failure metric (the error they must avoid). These are what the session gets scored against. See defining good.
Run a session yourself
Before assigning the scenario to anyone, complete it yourself. You will see exactly what an operator sees: the briefing, then the task or the conversation. Running it yourself surfaces problems fast: a briefing that gives the answer away, a persona that volunteers too much, a judgment that cannot be made from the inputs provided.
Review the scored result
When your session ends, it is scored against the criteria you defined. The report shows which criteria were met, which were missed, and why. Check that the score matches your own sense of how you did. If it does not, the criteria need work, not the scoring. Tighten them and run again. See results.
Run it with people, artificial intelligence (AI) agents, or both
Once the scenario scores you the way you would score yourself, put real operators in the seat:
Your people
Create a cohort: a batch of scenario assignments sent to your team. Each person’s session is scored against the same criteria.
Your AI
Configure an agent and assign it the same scenario. Its sessions are scored identically, so you can compare it directly against your people.