Results and Scoring

Scoring on Tacit follows one rule: the definition of good is written down before anyone runs the scenario, and every session is measured against it. People and AI agents are scored against the same criteria, so their results are directly comparable.

What gets measured

A session is scored against whatever its scenario defines in Defining Good:

Criteria family	The question it answers
Success metrics	Did the operator accomplish what the scenario set out?
Failure metrics	Did they avoid the errors that must never happen?
Rubrics	Where does the performance sit on each dimension, level by level?
Scope boundaries	Did they stay within their authority, escalate what needed escalating, and avoid what is prohibited?
Terminology	Did they use the domain’s language correctly?
Outputs	Do the artifacts and decisions produced hold up against their definitions and exemplars?

Depending on the scenario, scoring covers the conversation, the outputs, or both. A decision task is scored on its decision and rationale; a conversation also on what was elicited and how; a journey also on what held up across sessions.

Reading a report

A scored session shows each criterion with its result and the reasoning behind it, grounded in the transcript and outputs. You can see not just that a failure metric fired, but the moment in the conversation it points to. The score is an entry point into the session, not a verdict that replaces it.

Comparing results

Because criteria are fixed per scenario, results aggregate cleanly:

Across operators - a cohort shows per-operator results on the same scenarios, so patterns separate from individuals
Across agents - a benchmark run shows the same breakdown for each agent, prompt version, or model configuration you test
People against AI - the same scenario, the same criteria, side by side

Treat a single session as one observation, not a conclusion. The platform makes it cheap to run a scenario several times; spreads across repeated sessions tell you what is signal and what is one good or bad day.

Improving the criteria

Criteria are not fixed forever. Expert review sessions and their reflections surface gaps: a success metric that no expert actually hits, a failure metric that fires on reasonable behavior, terminology the scenario never defined. Refine the criteria, and the next sessions are scored against the better definition. The criteria are versioned with the scenario, so cohorts keep the definition they started with.

Roles and MembersHow organizations, member permissions, and AI roles structure everything you build in Tacit

Getting Started

What Can You Build?

Scenario Anatomy

Running Sessions

Results

Organization

Connectors

What gets measured

Reading a report

Comparing results

Improving the criteria

​What gets measured

​Reading a report

​Comparing results

​Improving the criteria

What gets measured

Reading a report

Comparing results

Improving the criteria