A goal-prompt writer plus a four-agent planner / maker / prover / checker loop. It turns a task into an autonomous run that grades its own work honestly - because the model that wrote the code never scores it.
One model writing and grading its own output grades generously. The harness splits the work across agents with separate context and separate tools, so verification is never done by the author.
An agent that just wrote 300 lines will call them correct. Its "PASS" carries no information - it saw its own reasoning and is anchored to it.
A mechanical gate (grep / parse / exit-code - no LLM) runs first. Then a fresh-context Checker that never saw the maker's work scores against a rubric. It opens with "I did not write this."
Every task is routed on two independent axes. Picking wrong wastes turns or under-delivers.
Where it runs. In-session harness (<1hr, needs decisions) · gnhf autonomous (>1hr, unattended) · parallel gnhf + treehouse (many streams). Always registered in tasks-axi first.
What shape the work is. First-match-wins over four modes - stop at the first yes.
Bounded, one pass, correctness obvious on read.
Measurable bar one pass won't reliably hit. Iterate to it.
Recurring, or watching state you don't control.
50+ items, or output needs adversarial verification.
Each agent has its own context and a deliberately narrow toolset. The Planner never makes; the Checker never runs code.
Two complementary checks before the Checker scores. Prover proves it works; red-team proves it doesn't break.
Actually exercises the feature - CLI command, curl, or browser - and returns works / broken with pasted output. No screenshot, no verdict.
Four attack roles in parallel - hostile user, careless user, performance, security - each hunts only its own angle from entry to failure. A barrier merges them worst-first, deduped across roles. Critical/high holes get fixed before scoring.
Up to 3 cycles. The spec is fixed - only the output changes. Done = gate green and Checker mean ≥ 3.5 / 5.0.
For >1hr unattended work, a single detached call launches an overnight run that survives the session closing - model pinned to Opus by config.
/goal rejects any condition ≥ 4000 chars. A script counts exactly what /goal counts (UTF-16 length after stripping one newline) and exits non-zero at ≥ 3990 - a failed gate is a failed command, not a judgment call.
The repo copy is the source of truth. Every changed skill, agent, and workflow syncs repo → global (~/.claude) so a run behaves identically everywhere - reconciled as a superset, never blind-overwritten.