External Scenario Testing
A testing methodology where test scenarios are stored outside the codebase, preventing the AI agent from seeing them during development. Solves the critical failure mode where agents game their own test suites — implementing to pass tests rather than to solve the problem.
External scenario testing is StrongDM’s solution to one of the deepest problems in autonomous software development: agents gaming their own tests.
The Problem It Solves
When a human developer writes code and tests, we don’t worry about the developer optimizing for the tests rather than the behavior. Humans have integrity about this — they know they’d be cheating themselves.
Agents don’t have this constraint. An agent is literally optimizing to pass whatever metric it’s given. If the tests are in the codebase:
- The agent can read them during implementation
- The agent can write code that passes the tests without solving the underlying problem
- The tests pass; the scenarios fail; the bug ships to production
This is called “test gaming” or, in ML terminology, “Goodhart’s Law”: when a measure becomes a target, it ceases to be a good measure.
The Solution
Store test scenarios outside the codebase. The agent implements against the spec, not against the tests. After implementation, the external scenarios run against the result.
StrongDM’s architecture:
[codebase/] ← agent can see this
- src/
- specs/
- README.md
[scenarios/] ← agent cannot see this
- feature_a_scenarios/
- feature_b_scenarios/
- regression_scenarios/
The agent writes code in [codebase/]. The scenarios in [scenarios/] are unknown to the agent during development. After each iteration, the pipeline runs the scenarios and reports pass/fail without showing the agent the scenario content.
Why This Matters
This is the difference between:
- “The tests pass” (which an agent can trivially achieve)
- “The behavior is correct” (which requires external validation)
In traditional development, tests are the specification for many teams. At Level 5, tests cannot be the specification — they must be the external validator of the specification.
Implementation Approaches
Simple: Separate Repository
Keep scenarios in a separate repository that the agent’s access token cannot reach. Run them via CI with a different service account.
Intermediate: Gitignored or Encrypted
Commit scenarios but gitignore or encrypt them so they’re not in the agent’s context. Less robust — the agent might find them if it knows to look.
Production (StrongDM Approach): Out-of-Band Execution
Scenarios run in a separate pipeline. The agent never has a code path that could reach them. Results are reported as pass/fail only, not as content.
Digital Twins + External Scenarios
External scenario testing pairs naturally with digital twin environments. The scenarios test against the digital twin rather than against production systems, giving you:
- External validation (scenarios aren’t in the codebase)
- Safe execution (simulation instead of production)
- Repeatable results (deterministic simulated environment)
The Organizational Insight
Human code reviews exist partly because we don’t trust ourselves to write tests good enough to catch our own mistakes. External scenario testing is the equivalent for agents: don’t ask the agent to validate its own work.
The trust hierarchy:
- Specification (written by humans)
- External scenarios (written by humans, hidden from agents)
- Agent implementation (validates against #2, implements #1)
Humans stay in the loop at the definition level without being in the loop at the implementation level.