Results and Scoring
Scoring on Tacit follows one rule: the definition of good is written down before anyone runs the scenario, and every session is measured against it. People and AI agents are scored against the same criteria, so their results are directly comparable.What gets measured
A session is scored against whatever its scenario defines in Defining Good:| Criteria family | The question it answers |
|---|---|
| Success metrics | Did the operator accomplish what the scenario set out? |
| Failure metrics | Did they avoid the errors that must never happen? |
| Rubrics | Where does the performance sit on each dimension, level by level? |
| Scope boundaries | Did they stay within their authority, escalate what needed escalating, and avoid what is prohibited? |
| Terminology | Did they use the domain’s language correctly? |
| Outputs | Do the artifacts and decisions produced hold up against their definitions and exemplars? |
Reading a report
A scored session shows each criterion with its result and the reasoning behind it, grounded in the transcript and outputs. You can see not just that a failure metric fired, but the moment in the conversation it points to. The score is an entry point into the session, not a verdict that replaces it.Comparing results
Because criteria are fixed per scenario, results aggregate cleanly:- Across operators - a cohort shows per-operator results on the same scenarios, so patterns separate from individuals
- Across agents - a benchmark run shows the same breakdown for each agent, prompt version, or model configuration you test
- People against AI - the same scenario, the same criteria, side by side
Treat a single session as one observation, not a conclusion. The platform makes it cheap to run a scenario several times; spreads across repeated sessions tell you what is signal and what is one good or bad day.