Results and Scoring

Scoring on Tacit follows one rule: the definition of good is written down before anyone runs the scenario, and every session is measured against it. People and AI agents are scored against the same criteria, so their results are directly comparable.

What gets measured

A session is scored against whatever its scenario defines in Defining Good:
Criteria familyThe question it answers
Success metricsDid the operator accomplish what the scenario set out?
Failure metricsDid they avoid the errors that must never happen?
RubricsWhere does the performance sit on each dimension, level by level?
Scope boundariesDid they stay within their authority, escalate what needed escalating, and avoid what is prohibited?
TerminologyDid they use the domain’s language correctly?
OutputsDo the artifacts and decisions produced hold up against their definitions and exemplars?
Depending on the scenario, scoring covers the conversation, the outputs, or both. A decision task is scored on its decision and rationale; a conversation also on what was elicited and how; a journey also on what held up across sessions.

Reading a report

A scored session shows each criterion with its result and the reasoning behind it, grounded in the transcript and outputs. You can see not just that a failure metric fired, but the moment in the conversation it points to. The score is an entry point into the session, not a verdict that replaces it.

Comparing results

Because criteria are fixed per scenario, results aggregate cleanly:
  • Across operators - a cohort shows per-operator results on the same scenarios, so patterns separate from individuals
  • Across agents - a benchmark run shows the same breakdown for each agent, prompt version, or model configuration you test
  • People against AI - the same scenario, the same criteria, side by side
Treat a single session as one observation, not a conclusion. The platform makes it cheap to run a scenario several times; spreads across repeated sessions tell you what is signal and what is one good or bad day.

Improving the criteria

Criteria are not fixed forever. Expert review sessions and their reflections surface gaps: a success metric that no expert actually hits, a failure metric that fires on reasonable behavior, terminology the scenario never defined. Refine the criteria, and the next sessions are scored against the better definition. The criteria are versioned with the scenario, so cohorts keep the definition they started with.