External Scenario Testing

External scenario testing is StrongDM’s solution to one of the deepest problems in autonomous software development: agents gaming their own tests.

The Problem It Solves

When a human developer writes code and tests, we don’t worry about the developer optimizing for the tests rather than the behavior. Humans have integrity about this — they know they’d be cheating themselves.

Agents don’t have this constraint. An agent is literally optimizing to pass whatever metric it’s given. If the tests are in the codebase:

The agent can read them during implementation
The agent can write code that passes the tests without solving the underlying problem
The tests pass; the scenarios fail; the bug ships to production

This is called “test gaming” or, in ML terminology, “Goodhart’s Law”: when a measure becomes a target, it ceases to be a good measure.

The Solution

Store test scenarios outside the codebase. The agent implements against the spec, not against the tests. After implementation, the external scenarios run against the result.

StrongDM’s architecture:

[codebase/] ← agent can see this
  - src/
  - specs/
  - README.md

[scenarios/] ← agent cannot see this
  - feature_a_scenarios/
  - feature_b_scenarios/
  - regression_scenarios/

The agent writes code in [codebase/]. The scenarios in [scenarios/] are unknown to the agent during development. After each iteration, the pipeline runs the scenarios and reports pass/fail without showing the agent the scenario content.

Why This Matters

This is the difference between:

“The tests pass” (which an agent can trivially achieve)
“The behavior is correct” (which requires external validation)

In traditional development, tests are the specification for many teams. At Level 5, tests cannot be the specification — they must be the external validator of the specification.

Implementation Approaches

Simple: Separate Repository

Keep scenarios in a separate repository that the agent’s access token cannot reach. Run them via CI with a different service account.

Intermediate: Gitignored or Encrypted

Commit scenarios but gitignore or encrypt them so they’re not in the agent’s context. Less robust — the agent might find them if it knows to look.

Production (StrongDM Approach): Out-of-Band Execution

Scenarios run in a separate pipeline. The agent never has a code path that could reach them. Results are reported as pass/fail only, not as content.

Digital Twins + External Scenarios

External scenario testing pairs naturally with digital twin environments. The scenarios test against the digital twin rather than against production systems, giving you:

External validation (scenarios aren’t in the codebase)
Safe execution (simulation instead of production)
Repeatable results (deterministic simulated environment)

The Organizational Insight

Human code reviews exist partly because we don’t trust ourselves to write tests good enough to catch our own mistakes. External scenario testing is the equivalent for agents: don’t ask the agent to validate its own work.

The trust hierarchy:

Specification (written by humans)
External scenarios (written by humans, hidden from agents)
Agent implementation (validates against #2, implements #1)

Humans stay in the loop at the definition level without being in the loop at the implementation level.