SKILL.md

$27

When NOT to Use

Quick single-file fixes (use standard claude -p)

Tasks with tight budget constraints (<$10)

Simple refactoring (use de-sloppify pattern instead)

Tasks that are already well-specified with tests (use TDD workflow)

Architecture

┌─────────────┐

                    │   PLANNER   │

                    │  (Opus 4.6) │

                    └──────┬──────┘

                           │ Product Spec

                           │ (features, sprints, design direction)

                           ▼

              ┌────────────────────────┐

              │                        │

              │   GENERATOR-EVALUATOR  │

              │      FEEDBACK LOOP     │

              │                        │

              │  ┌──────────┐          │

              │  │GENERATOR │--build-->│──┐

              │  │(Opus 4.6)│          │  │

              │  └────▲─────┘          │  │

              │       │                │  │ live app

              │    feedback             │  │

              │       │                │  │

              │  ┌────┴─────┐          │  │

              │  │EVALUATOR │<-test----│──┘

              │  │(Opus 4.6)│          │

              │  │+Playwright│         │

              │  └──────────┘          │

              │                        │

              │   5-15 iterations      │

              └────────────────────────┘

The Three Agents

1. Planner Agent

Role: Product manager — expands a brief prompt into a full product specification.

Key behaviors:

Takes a one-line prompt and produces a 16-feature, multi-sprint specification

Defines user stories, technical requirements, and visual design direction

Is deliberately ambitious — conservative planning leads to underwhelming results

Produces evaluation criteria that the Evaluator will use later

Model: Opus 4.6 (needs deep reasoning for spec expansion)

2. Generator Agent

Role: Developer — implements features according to the spec.

Key behaviors:

Works in structured sprints (or continuous mode with newer models)

Negotiates a "sprint contract" with the Evaluator before writing code

Uses full-stack tooling: React, FastAPI/Express, databases, CSS

Manages git for version control between iterations

Reads Evaluator feedback and incorporates it in next iteration

Model: Opus 4.6 (needs strong coding capability)

3. Evaluator Agent

Role: QA engineer — tests the live running application, not just code.

Key behaviors:

Uses Playwright MCP to interact with the live application

Clicks through features, fills forms, tests API endpoints

Scores against four criteria (configurable):

Design Quality — Does it feel like a coherent whole?

Originality — Custom decisions vs. template/AI patterns?

Craft — Typography, spacing, animations, micro-interactions?

Functionality — Do all features actually work?

Returns structured feedback with scores and specific issues

Is engineered to be ruthlessly strict — never praises mediocre work

Model: Opus 4.6 (needs strong judgment + tool use)

Evaluation Criteria

The default four criteria, each scored 1-10:

## Evaluation Rubric

### Design Quality (weight: 0.3)

- 1-3: Generic, template-like, "AI slop" aesthetics

- 4-6: Competent but unremarkable, follows conventions

- 7-8: Distinctive, cohesive visual identity

- 9-10: Could pass for a professional designer's work

### Originality (weight: 0.2)

- 1-3: Default colors, stock layouts, no personality

- 4-6: Some custom choices, mostly standard patterns

- 7-8: Clear creative vision, unique approach

- 9-10: Surprising, delightful, genuinely novel

### Craft (weight: 0.3)

- 1-3: Broken layouts, missing states, no animations

- 4-6: Works but feels rough, inconsistent spacing

- 7-8: Polished, smooth transitions, responsive

- 9-10: Pixel-perfect, delightful micro-interactions

### Functionality (weight: 0.2)

- 1-3: Core features broken or missing

- 4-6: Happy path works, edge cases fail

- 7-8: All features work, good error handling

- 9-10: Bulletproof, handles every edge case

Scoring

Weighted score = sum of (criterion_score * weight)

Pass threshold = 7.0 (configurable)

Max iterations = 15 (configurable, typically 5-15 sufficient)

Usage

Via Command

# Full three-agent harness

/project:gan-build "Build a project management app with Kanban boards, team collaboration, and dark mode"

# With custom config

/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5

# Frontend design mode (generator + evaluator only, no planner)

/project:gan-design "Create a landing page for a crypto portfolio tracker"

Via Shell Script

# Basic usage

./scripts/gan-harness.sh "Build a music streaming dashboard"

# With options

GAN_MAX_ITERATIONS=10 \

GAN_PASS_THRESHOLD=7.5 \

GAN_EVAL_CRITERIA="functionality,performance,security" \

./scripts/gan-harness.sh "Build a REST API for task management"

Via Claude Code (Manual)

# Step 1: Plan

claude -p --model opus "You are a Product Planner. Read PLANNER_PROMPT.md. Expand this brief into a full product spec: 'Build a Kanban board app'. Write spec to spec.md"

# Step 2: Generate (iteration 1)

claude -p --model opus "You are a Generator. Read spec.md. Implement Sprint 1. Start the dev server on port 3000."

# Step 3: Evaluate (iteration 1)

claude -p --model opus --allowedTools "Read,Bash,mcp__playwright__*" "You are an Evaluator. Read EVALUATOR_PROMPT.md. Test the live app at http://localhost:3000. Score against the rubric. Write feedback to feedback-001.md"

# Step 4: Generate (iteration 2 — reads feedback)

claude -p --model opus "You are a Generator. Read spec.md and feedback-001.md. Address all issues. Improve the scores."

# Repeat steps 3-4 until pass threshold met

Evolution Across Model Capabilities

The harness should simplify as models improve. Following Anthropic's evolution:

Stage 1 — Weaker Models (Sonnet-class)

Full sprint decomposition required

Context resets between sprints (avoid context anxiety)

2-agent minimum: Initializer + Coding Agent

Heavy scaffolding compensates for model limitations

Stage 2 — Capable Models (Opus 4.5-class)

Full 3-agent harness: Planner + Generator + Evaluator

Sprint contracts before each implementation phase

10-sprint decomposition for complex apps

Context resets still useful but less critical

Stage 3 — Frontier Models (Opus 4.6-class)

Simplified harness: single planning pass, continuous generation

Evaluation reduced to single end-pass (model is smarter)

No sprint structure needed

Automatic compaction handles context growth

Key principle: Every harness component encodes an assumption about what the model can't do alone. When models improve, re-test those assumptions. Strip away what's no longer needed.

Configuration

Environment Variables

Variable

Default

Description

GAN_MAX_ITERATIONS

15

Maximum generator-evaluator cycles

GAN_PASS_THRESHOLD

7.0

Weighted score to pass (1-10)

GAN_PLANNER_MODEL

opus

Model for planning agent

GAN_GENERATOR_MODEL

opus

Model for generator agent

GAN_EVALUATOR_MODEL

opus

Model for evaluator agent

GAN_EVAL_CRITERIA

design,originality,craft,functionality

Comma-separated criteria

GAN_DEV_SERVER_PORT

3000

Port for the live app

GAN_DEV_SERVER_CMD

npm run dev

Command to start dev server

GAN_PROJECT_DIR

.

Project working directory

GAN_SKIP_PLANNER

false

Skip planner, use spec directly

GAN_EVAL_MODE

playwright

playwright, screenshot, or code-only

Evaluation Modes

Mode

Tools

Best For

playwright

Browser MCP + live interaction

Full-stack apps with UI

screenshot

Screenshot + visual analysis

Static sites, design-only

code-only

Tests + linting + build

APIs, libraries, CLI tools

Anti-Patterns

Evaluator too lenient — If the evaluator passes everything on iteration 1, your rubric is too generous. Tighten scoring criteria and add explicit penalties for common AI patterns.

Generator ignoring feedback — Ensure feedback is passed as a file, not inline. The generator should read feedback-NNN.md at the start of each iteration.

Infinite loops — Always set GAN_MAX_ITERATIONS. If the generator can't improve past a score plateau after 3 iterations, stop and flag for human review.

Evaluator testing superficially — The evaluator must use Playwright to interact with the live app, not just screenshot it. Click buttons, fill forms, test error states.

Evaluator praising its own fixes — Never let the evaluator suggest fixes and then evaluate those fixes. The evaluator only critiques; the generator fixes.

Context exhaustion — For long sessions, use Claude Agent SDK's automatic compaction or reset context between major phases.

Results: What to Expect

Based on Anthropic's published results:

Metric

Solo Agent

GAN Harness

Improvement

Time

20 min

4-6 hours

12-18x longer

Cost

$125-200

14-22x more

Quality

Barely functional

Production-ready

Phase change

Core features

Broken

All working

N/A

Design

Generic AI slop

Distinctive, polished

N/A

The tradeoff is clear: ~20x more time and cost for a qualitative leap in output quality. This is for projects where quality matters.

References

Anthropic: Harness Design for Long-Running Apps — Original paper by Prithvi Rajasekaran

Epsilla: The GAN-Style Agent Loop — Architecture deconstruction

Martin Fowler: Harness Engineering — Broader industry context

OpenAI: Harness Engineering — OpenAI's parallel work

gan-style-harness

SKILL.md

When NOT to Use

Architecture

The Three Agents

1. Planner Agent

2. Generator Agent

3. Evaluator Agent

Evaluation Criteria

Scoring

Usage

Via Command

Via Shell Script

Via Claude Code (Manual)

Evolution Across Model Capabilities

Stage 1 — Weaker Models (Sonnet-class)

Stage 2 — Capable Models (Opus 4.5-class)

Stage 3 — Frontier Models (Opus 4.6-class)

Configuration

Environment Variables

Evaluation Modes

Anti-Patterns

Results: What to Expect

References

Stop writing automation&scrapers

gan-style-harness

SKILL.md

When NOT to Use

Architecture

The Three Agents

1. Planner Agent

2. Generator Agent

3. Evaluator Agent

Evaluation Criteria

Scoring

Usage

Via Command

Via Shell Script

Via Claude Code (Manual)

Evolution Across Model Capabilities

Stage 1 — Weaker Models (Sonnet-class)

Stage 2 — Capable Models (Opus 4.5-class)

Stage 3 — Frontier Models (Opus 4.6-class)

Configuration

Environment Variables

Evaluation Modes

Anti-Patterns

Results: What to Expect

References

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers