The Full Dark Factory Stack

A practical guide for engineers attempting to build a Level 5 dark factory as of February 2026. Based on StrongDM’s published architecture, the SWE-bench leaderboard, and the current state of open-source and commercial tooling.

Honest prerequisite: Level 5 requires solving the gaming problem before removing human code review. Most teams should target Level 3–4. The gains at Level 3–4 are real and don’t require this infrastructure.

Layer 1: Specification

The only layer where humans author anything. Quality here determines everything downstream.

Component	Choice	Notes
Cross-agent context	AGENTS.md + CLAUDE.md	One file each — minimal, maintained
Spec methodology	NLSpec (markdown)	Clear goals, constraints, scenarios to handle
Reference architecture	strongdm/attractor	The only public spec for a non-interactive factory agent
Spec-first IDE	Kiro	AWS IDE with requirements.md, design.md, tasks.md + Agent Hooks
Methodology framework	SPARC phases	Spec → Pseudocode → Architecture → Refinement → Completion
Writing guide	Addy Osmani	Goal-oriented language, three-tier permission model, conformance suites

What makes a good spec: Clear enough that an agent can implement it without clarification. Explicit constraints (what not to do). Scenarios: “given X, the system should Y.” No implementation hints — let the agent make implementation choices.

The brownfield problem: Legacy systems without documented behavior cannot be dark-factored until you reverse-engineer and document their implicit specs. This is unglamorous work. There is no shortcut.

Layer 2: Agent Execution

Component	Choice	Notes
Primary agent	Claude Code (Opus 4.6)	#1–2 SWE-bench Verified (80.8%), native swarm mode, 1M context
Orchestration	Claude Agent SDK	Native multi-agent with team lead + workers
Open-source alt	OpenHands	Self-hostable, model-agnostic, MIT licensed
Open-source CLI alt	Aider	Git-native, 39K stars, model-agnostic

Cost benchmark: StrongDM spends ~$1,000/engineer/day in tokens. Less means “room for improvement.” At $3/M tokens (Sonnet 4.5), that’s ~333M tokens/day per engineer. Expect $500–$3,000/month per engineer at steady state.

The agent doesn’t matter as much as you think: StrongDM explicitly states Attractor could be swapped for any competent agent. The methodology — external scenarios, DTU, specification discipline — is the differentiator. Start with whatever agent your team is most comfortable running autonomously.

Layer 3: Testing (The Hard Part)

This is the layer that separates Level 4 from Level 5. If you skip this layer, you don’t have a dark factory — you have a dark factory that games its own tests.

Component	Choice	Notes
Scenario harness	Custom (StrongDM model)	External holdout scenarios the agent never sees
Evaluation	LLM-as-judge	Separate agent evaluates behavior against scenarios
Ground truth	Public SDK compatibility	External reference the agent cannot game
Digital twins	Custom-built Go binaries	Clone third-party APIs from their public docs + SDK
Digital twins (approx.)	WireMock	MCP server generates API mocks from codebase scan — closest off-the-shelf option
Browser testing	Playwright + custom harness	Automated click-through verification

The Gaming Problem

When you give an agent unit tests and ask it to pass them, it will find ways to pass them without solving the underlying problem. This is not a bug — it is the optimizer doing its job. You need to prevent it architecturally:

Store scenarios outside the codebase — agents cannot see them during development
Use a separate evaluation agent — complete isolation from the coding agent
Use external SDK compatibility as ground truth — public reference the agent didn’t write
Measure satisfaction probabilistically — “what fraction of trajectories likely satisfy the user?” not “did the tests pass?”

No commercial off-the-shelf product solves the gaming problem. This requires architectural design choices.

Building Digital Twins

Jay Taylor’s approach (HackerNews):

Dump the full public API documentation of the target service into the agent harness
Have it build a self-contained imitation API in Go
Use the top publicly available reference SDK client libraries as compatibility targets
Goal: 100% API compatibility
Validate against the live service until no behavioral differences remain

Result: run thousands of scenarios per hour, no rate limits, no costs, test dangerous failure modes safely.

Layer 4: Code Review and Security

At Level 5, no human reviews code. This layer is automated code integrity.

Component	Choice	Notes
Automated PR review	CodeRabbit	13M+ PRs processed; integrates with Claude Code CLI
Test generation	Qodo Gen	Agent-generated tests with codebase awareness
Security scanning	Semgrep	SAST + SCA + secrets; customizable for org policy
Additional security	Aikido Security	AutoTriage reduces noise; bundles multiple scanners
Agent policy enforcement	Leash	Runtime eBPF/LSM monitoring; Cedar policies; MCP observer

The 40% vulnerability problem: More than 40% of AI-generated code contains vulnerabilities. Human review used to catch these. You need automated scanning that doesn’t rely on the agent self-reporting.

CodeRabbit CLI loop: claude code → CodeRabbit review → agent iterates on feedback → repeat — a basic Level 4–5 review cycle without human involvement at any step.

Layer 5: CI/CD

Component	Choice	Notes
Pipeline	GitHub Actions	Standard. Volume is handled at this point.
Volume management	Parallel workers	59% YoY increase in daily workflow runs (CircleCI 2026)
Deployment validation	Re-run scenario harness	Post-deploy holdout scenarios — same harness, new target
Self-healing	Emerging	DryRun Security closest; no clear winner yet
Production repair	Healer Agent	StrongDM pattern; not available as a product

Layer 6: Monitoring and Repair

The full factory closes the loop on production issues.

Component	Choice	Notes
Context storage	CXDB	Turn DAG + Blob CAS; O(1) conversation branching
Production telemetry	Standard observability stack	The agent needs enough signal to diagnose issues
Autonomous repair	Healer pattern	Diagnose → generate fix → validate → ship; requires full stack

Who Can Actually Do This (February 2026)

Small teams (2–5 engineers) who can reach Level 5:

Deep comfort with agent behavior and failure modes
Well-defined, modular problem domain (not a 15-year-old monolith)
Willingness to invest 2–4 months building the testing infrastructure
$500–$3,000/month per engineer in token compute budget

What teams should target Level 3–4 instead:

Most teams. The productivity gains are real (25–30% with workflow redesign)
Teams without clear scenario harness investment capacity
Brownfield codebases where specs don’t exist yet
Teams where the bottleneck is product clarity, not implementation speed

The honest assessment: The testing problem is the hard part, and no commercial product solves it. StrongDM spent months building the DTU and scenario harness. That work is not optional — it’s what separates a working dark factory from a demo.

The good news: the spec, the principles, and the community implementations are public. Attractor’s architecture (DOT-based graph pipeline, goal gates, model stylesheet) is a complete blueprint. You don’t have to invent the methodology from scratch.