The Full Dark Factory Stack
A layer-by-layer guide to the tools and architectural choices required to build a Level 5 software dark factory in 2026. Covers specification, agent execution, testing (the hard part), code review, CI/CD, and the honest assessment of what's hard and who can actually do this.
A practical guide for engineers attempting to build a Level 5 dark factory as of February 2026. Based on StrongDM’s published architecture, the SWE-bench leaderboard, and the current state of open-source and commercial tooling.
Honest prerequisite: Level 5 requires solving the gaming problem before removing human code review. Most teams should target Level 3–4. The gains at Level 3–4 are real and don’t require this infrastructure.
Layer 1: Specification
The only layer where humans author anything. Quality here determines everything downstream.
| Component | Choice | Notes |
|---|---|---|
| Cross-agent context | AGENTS.md + CLAUDE.md | One file each — minimal, maintained |
| Spec methodology | NLSpec (markdown) | Clear goals, constraints, scenarios to handle |
| Reference architecture | strongdm/attractor | The only public spec for a non-interactive factory agent |
| Spec-first IDE | Kiro | AWS IDE with requirements.md, design.md, tasks.md + Agent Hooks |
| Methodology framework | SPARC phases | Spec → Pseudocode → Architecture → Refinement → Completion |
| Writing guide | Addy Osmani | Goal-oriented language, three-tier permission model, conformance suites |
What makes a good spec: Clear enough that an agent can implement it without clarification. Explicit constraints (what not to do). Scenarios: “given X, the system should Y.” No implementation hints — let the agent make implementation choices.
The brownfield problem: Legacy systems without documented behavior cannot be dark-factored until you reverse-engineer and document their implicit specs. This is unglamorous work. There is no shortcut.
Layer 2: Agent Execution
| Component | Choice | Notes |
|---|---|---|
| Primary agent | Claude Code (Opus 4.6) | #1–2 SWE-bench Verified (80.8%), native swarm mode, 1M context |
| Orchestration | Claude Agent SDK | Native multi-agent with team lead + workers |
| Open-source alt | OpenHands | Self-hostable, model-agnostic, MIT licensed |
| Open-source CLI alt | Aider | Git-native, 39K stars, model-agnostic |
Cost benchmark: StrongDM spends ~$1,000/engineer/day in tokens. Less means “room for improvement.” At $3/M tokens (Sonnet 4.5), that’s ~333M tokens/day per engineer. Expect $500–$3,000/month per engineer at steady state.
The agent doesn’t matter as much as you think: StrongDM explicitly states Attractor could be swapped for any competent agent. The methodology — external scenarios, DTU, specification discipline — is the differentiator. Start with whatever agent your team is most comfortable running autonomously.
Layer 3: Testing (The Hard Part)
This is the layer that separates Level 4 from Level 5. If you skip this layer, you don’t have a dark factory — you have a dark factory that games its own tests.
| Component | Choice | Notes |
|---|---|---|
| Scenario harness | Custom (StrongDM model) | External holdout scenarios the agent never sees |
| Evaluation | LLM-as-judge | Separate agent evaluates behavior against scenarios |
| Ground truth | Public SDK compatibility | External reference the agent cannot game |
| Digital twins | Custom-built Go binaries | Clone third-party APIs from their public docs + SDK |
| Digital twins (approx.) | WireMock | MCP server generates API mocks from codebase scan — closest off-the-shelf option |
| Browser testing | Playwright + custom harness | Automated click-through verification |
The Gaming Problem
When you give an agent unit tests and ask it to pass them, it will find ways to pass them without solving the underlying problem. This is not a bug — it is the optimizer doing its job. You need to prevent it architecturally:
- Store scenarios outside the codebase — agents cannot see them during development
- Use a separate evaluation agent — complete isolation from the coding agent
- Use external SDK compatibility as ground truth — public reference the agent didn’t write
- Measure satisfaction probabilistically — “what fraction of trajectories likely satisfy the user?” not “did the tests pass?”
No commercial off-the-shelf product solves the gaming problem. This requires architectural design choices.
Building Digital Twins
Jay Taylor’s approach (HackerNews):
- Dump the full public API documentation of the target service into the agent harness
- Have it build a self-contained imitation API in Go
- Use the top publicly available reference SDK client libraries as compatibility targets
- Goal: 100% API compatibility
- Validate against the live service until no behavioral differences remain
Result: run thousands of scenarios per hour, no rate limits, no costs, test dangerous failure modes safely.
Layer 4: Code Review and Security
At Level 5, no human reviews code. This layer is automated code integrity.
| Component | Choice | Notes |
|---|---|---|
| Automated PR review | CodeRabbit | 13M+ PRs processed; integrates with Claude Code CLI |
| Test generation | Qodo Gen | Agent-generated tests with codebase awareness |
| Security scanning | Semgrep | SAST + SCA + secrets; customizable for org policy |
| Additional security | Aikido Security | AutoTriage reduces noise; bundles multiple scanners |
| Agent policy enforcement | Leash | Runtime eBPF/LSM monitoring; Cedar policies; MCP observer |
The 40% vulnerability problem: More than 40% of AI-generated code contains vulnerabilities. Human review used to catch these. You need automated scanning that doesn’t rely on the agent self-reporting.
CodeRabbit CLI loop: claude code → CodeRabbit review → agent iterates on feedback → repeat — a basic Level 4–5 review cycle without human involvement at any step.
Layer 5: CI/CD
| Component | Choice | Notes |
|---|---|---|
| Pipeline | GitHub Actions | Standard. Volume is handled at this point. |
| Volume management | Parallel workers | 59% YoY increase in daily workflow runs (CircleCI 2026) |
| Deployment validation | Re-run scenario harness | Post-deploy holdout scenarios — same harness, new target |
| Self-healing | Emerging | DryRun Security closest; no clear winner yet |
| Production repair | Healer Agent | StrongDM pattern; not available as a product |
Layer 6: Monitoring and Repair
The full factory closes the loop on production issues.
| Component | Choice | Notes |
|---|---|---|
| Context storage | CXDB | Turn DAG + Blob CAS; O(1) conversation branching |
| Production telemetry | Standard observability stack | The agent needs enough signal to diagnose issues |
| Autonomous repair | Healer pattern | Diagnose → generate fix → validate → ship; requires full stack |
Who Can Actually Do This (February 2026)
Small teams (2–5 engineers) who can reach Level 5:
- Deep comfort with agent behavior and failure modes
- Well-defined, modular problem domain (not a 15-year-old monolith)
- Willingness to invest 2–4 months building the testing infrastructure
- $500–$3,000/month per engineer in token compute budget
What teams should target Level 3–4 instead:
- Most teams. The productivity gains are real (25–30% with workflow redesign)
- Teams without clear scenario harness investment capacity
- Brownfield codebases where specs don’t exist yet
- Teams where the bottleneck is product clarity, not implementation speed
The honest assessment: The testing problem is the hard part, and no commercial product solves it. StrongDM spent months building the DTU and scenario harness. That work is not optional — it’s what separates a working dark factory from a demo.
The good news: the spec, the principles, and the community implementations are public. Attractor’s architecture (DOT-based graph pipeline, goal gates, model stylesheet) is a complete blueprint. You don’t have to invent the methodology from scratch.
Further Reading
→ Further Reading — curated list of all primary sources, technical deep-dives, and open-source tools