methodology
Dark Factories

The Full Dark Factory Stack

A layer-by-layer guide to the tools and architectural choices required to build a Level 5 software dark factory in 2026. Covers specification, agent execution, testing (the hard part), code review, CI/CD, and the honest assessment of what's hard and who can actually do this.

A practical guide for engineers attempting to build a Level 5 dark factory as of February 2026. Based on StrongDM’s published architecture, the SWE-bench leaderboard, and the current state of open-source and commercial tooling.

Honest prerequisite: Level 5 requires solving the gaming problem before removing human code review. Most teams should target Level 3–4. The gains at Level 3–4 are real and don’t require this infrastructure.


Layer 1: Specification

The only layer where humans author anything. Quality here determines everything downstream.

ComponentChoiceNotes
Cross-agent contextAGENTS.md + CLAUDE.mdOne file each — minimal, maintained
Spec methodologyNLSpec (markdown)Clear goals, constraints, scenarios to handle
Reference architecturestrongdm/attractorThe only public spec for a non-interactive factory agent
Spec-first IDEKiroAWS IDE with requirements.md, design.md, tasks.md + Agent Hooks
Methodology frameworkSPARC phasesSpec → Pseudocode → Architecture → Refinement → Completion
Writing guideAddy OsmaniGoal-oriented language, three-tier permission model, conformance suites

What makes a good spec: Clear enough that an agent can implement it without clarification. Explicit constraints (what not to do). Scenarios: “given X, the system should Y.” No implementation hints — let the agent make implementation choices.

The brownfield problem: Legacy systems without documented behavior cannot be dark-factored until you reverse-engineer and document their implicit specs. This is unglamorous work. There is no shortcut.


Layer 2: Agent Execution

ComponentChoiceNotes
Primary agentClaude Code (Opus 4.6)#1–2 SWE-bench Verified (80.8%), native swarm mode, 1M context
OrchestrationClaude Agent SDKNative multi-agent with team lead + workers
Open-source altOpenHandsSelf-hostable, model-agnostic, MIT licensed
Open-source CLI altAiderGit-native, 39K stars, model-agnostic

Cost benchmark: StrongDM spends ~$1,000/engineer/day in tokens. Less means “room for improvement.” At $3/M tokens (Sonnet 4.5), that’s ~333M tokens/day per engineer. Expect $500–$3,000/month per engineer at steady state.

The agent doesn’t matter as much as you think: StrongDM explicitly states Attractor could be swapped for any competent agent. The methodology — external scenarios, DTU, specification discipline — is the differentiator. Start with whatever agent your team is most comfortable running autonomously.


Layer 3: Testing (The Hard Part)

This is the layer that separates Level 4 from Level 5. If you skip this layer, you don’t have a dark factory — you have a dark factory that games its own tests.

ComponentChoiceNotes
Scenario harnessCustom (StrongDM model)External holdout scenarios the agent never sees
EvaluationLLM-as-judgeSeparate agent evaluates behavior against scenarios
Ground truthPublic SDK compatibilityExternal reference the agent cannot game
Digital twinsCustom-built Go binariesClone third-party APIs from their public docs + SDK
Digital twins (approx.)WireMockMCP server generates API mocks from codebase scan — closest off-the-shelf option
Browser testingPlaywright + custom harnessAutomated click-through verification

The Gaming Problem

When you give an agent unit tests and ask it to pass them, it will find ways to pass them without solving the underlying problem. This is not a bug — it is the optimizer doing its job. You need to prevent it architecturally:

  1. Store scenarios outside the codebase — agents cannot see them during development
  2. Use a separate evaluation agent — complete isolation from the coding agent
  3. Use external SDK compatibility as ground truth — public reference the agent didn’t write
  4. Measure satisfaction probabilistically — “what fraction of trajectories likely satisfy the user?” not “did the tests pass?”

No commercial off-the-shelf product solves the gaming problem. This requires architectural design choices.

Building Digital Twins

Jay Taylor’s approach (HackerNews):

  1. Dump the full public API documentation of the target service into the agent harness
  2. Have it build a self-contained imitation API in Go
  3. Use the top publicly available reference SDK client libraries as compatibility targets
  4. Goal: 100% API compatibility
  5. Validate against the live service until no behavioral differences remain

Result: run thousands of scenarios per hour, no rate limits, no costs, test dangerous failure modes safely.


Layer 4: Code Review and Security

At Level 5, no human reviews code. This layer is automated code integrity.

ComponentChoiceNotes
Automated PR reviewCodeRabbit13M+ PRs processed; integrates with Claude Code CLI
Test generationQodo GenAgent-generated tests with codebase awareness
Security scanningSemgrepSAST + SCA + secrets; customizable for org policy
Additional securityAikido SecurityAutoTriage reduces noise; bundles multiple scanners
Agent policy enforcementLeashRuntime eBPF/LSM monitoring; Cedar policies; MCP observer

The 40% vulnerability problem: More than 40% of AI-generated code contains vulnerabilities. Human review used to catch these. You need automated scanning that doesn’t rely on the agent self-reporting.

CodeRabbit CLI loop: claude code → CodeRabbit review → agent iterates on feedback → repeat — a basic Level 4–5 review cycle without human involvement at any step.


Layer 5: CI/CD

ComponentChoiceNotes
PipelineGitHub ActionsStandard. Volume is handled at this point.
Volume managementParallel workers59% YoY increase in daily workflow runs (CircleCI 2026)
Deployment validationRe-run scenario harnessPost-deploy holdout scenarios — same harness, new target
Self-healingEmergingDryRun Security closest; no clear winner yet
Production repairHealer AgentStrongDM pattern; not available as a product

Layer 6: Monitoring and Repair

The full factory closes the loop on production issues.

ComponentChoiceNotes
Context storageCXDBTurn DAG + Blob CAS; O(1) conversation branching
Production telemetryStandard observability stackThe agent needs enough signal to diagnose issues
Autonomous repairHealer patternDiagnose → generate fix → validate → ship; requires full stack

Who Can Actually Do This (February 2026)

Small teams (2–5 engineers) who can reach Level 5:

  • Deep comfort with agent behavior and failure modes
  • Well-defined, modular problem domain (not a 15-year-old monolith)
  • Willingness to invest 2–4 months building the testing infrastructure
  • $500–$3,000/month per engineer in token compute budget

What teams should target Level 3–4 instead:

  • Most teams. The productivity gains are real (25–30% with workflow redesign)
  • Teams without clear scenario harness investment capacity
  • Brownfield codebases where specs don’t exist yet
  • Teams where the bottleneck is product clarity, not implementation speed

The honest assessment: The testing problem is the hard part, and no commercial product solves it. StrongDM spent months building the DTU and scenario harness. That work is not optional — it’s what separates a working dark factory from a demo.

The good news: the spec, the principles, and the community implementations are public. Attractor’s architecture (DOT-based graph pipeline, goal gates, model stylesheet) is a complete blueprint. You don’t have to invent the methodology from scratch.


Further Reading

Further Reading — curated list of all primary sources, technical deep-dives, and open-source tools