Decision Tasks

A decision task hands the operator everything up front: the case file, the context, the constraints. The operator reads it and produces a judgment, with the reasoning behind it. There is no dialogue to manage and no information to chase down. What gets measured is the decision itself. This is the simplest scenario shape, and the right place to start if your question is “do our people (or our AI) make the right call when the facts are in front of them?”

When to use a decision task

  • Classification problems: approve or deny, eligible or ineligible, urgent or routine
  • Triage: which queue, which specialist, which priority
  • Escalation calls: handle it, refer it, or stop and flag it
  • Document review: is this application complete, is this report consistent, does this claim hold together
If the hard part of the job is the judgment rather than the interaction, a decision task isolates exactly that.

What you configure

A decision task uses the standard scenario anatomy with the conversational parts switched off or minimized:
ComponentRole in a decision task
DocumentsCarry the case material itself: forms, reports, transcripts, records
BriefingFrames the operator’s role and what they are being asked to decide
Decision definitionsThe choice the operator must make, with its available options
Artifact definitionsThe written rationale, assessment note, or recommendation that accompanies the decision
CriteriaSuccess metrics, failure metrics, and scope boundaries that define the right call and the calls that must never be made
The decision and its rationale are both scored. A correct decision reached by the wrong reasoning is a finding, not a pass: the rationale is where you see whether the judgment will transfer to the next case.

Worked example

An income protection insurer wants to know whether claims assessors apply the policy’s work-capacity test consistently.
  • Documents: a claim file containing the claimant’s statement, an employer report, and two medical assessments that point in slightly different directions
  • Decision: continue payments, suspend payments, or request an independent medical examination, defined as a decision with three options
  • Artifact: a written assessment note justifying the choice against the policy wording
  • Success metrics: identifies the conflict between the two medical assessments; applies the work-capacity test from the policy, not a general impression of severity
  • Failure metric: suspends payments based on the employer report alone
Every assessor sees the identical file. Every AI agent sees the identical file. The scored results show who applied the test, who pattern-matched on severity words, and how the reasoning differed, case by case.

Running it

  • With your people: assign the task through a cohort. Operators complete it in one sitting; you see decisions, rationales, and scores per operator.
  • With an AI agent: run it in an automated benchmark. Because the inputs are fixed, decision tasks are the cheapest shape to run at volume, which makes them well suited to comparing several agents or prompt versions on the same case set.

Moving up the ladder

A decision task measures the judgment but assumes the facts arrive complete. In most real work, they do not: someone had to ask the right questions first. When you want to measure that part too, wrap the same judgment in a conversation scenario, where the persona holds the facts and reveals them only to operators who earn them.