Decision Tasks
A decision task hands the operator everything up front: the case file, the context, the constraints. The operator reads it and produces a judgment, with the reasoning behind it. There is no dialogue to manage and no information to chase down. What gets measured is the decision itself. This is the simplest scenario shape, and the right place to start if your question is “do our people (or our AI) make the right call when the facts are in front of them?”When to use a decision task
- Classification problems: approve or deny, eligible or ineligible, urgent or routine
- Triage: which queue, which specialist, which priority
- Escalation calls: handle it, refer it, or stop and flag it
- Document review: is this application complete, is this report consistent, does this claim hold together
What you configure
A decision task uses the standard scenario anatomy with the conversational parts switched off or minimized:| Component | Role in a decision task |
|---|---|
| Documents | Carry the case material itself: forms, reports, transcripts, records |
| Briefing | Frames the operator’s role and what they are being asked to decide |
| Decision definitions | The choice the operator must make, with its available options |
| Artifact definitions | The written rationale, assessment note, or recommendation that accompanies the decision |
| Criteria | Success metrics, failure metrics, and scope boundaries that define the right call and the calls that must never be made |
Worked example
An income protection insurer wants to know whether claims assessors apply the policy’s work-capacity test consistently.- Documents: a claim file containing the claimant’s statement, an employer report, and two medical assessments that point in slightly different directions
- Decision: continue payments, suspend payments, or request an independent medical examination, defined as a decision with three options
- Artifact: a written assessment note justifying the choice against the policy wording
- Success metrics: identifies the conflict between the two medical assessments; applies the work-capacity test from the policy, not a general impression of severity
- Failure metric: suspends payments based on the employer report alone
Running it
- With your people: assign the task through a cohort. Operators complete it in one sitting; you see decisions, rationales, and scores per operator.
- With an AI agent: run it in an automated benchmark. Because the inputs are fixed, decision tasks are the cheapest shape to run at volume, which makes them well suited to comparing several agents or prompt versions on the same case set.