gan-style-harness

GAN-inspired Generator-Evaluator agent harness for building high-quality applications autonomously. Based on Anthropic's March 2026 harness design paper.

INSTALLATION
npx skills add https://github.com/affaan-m/everything-claude-code --skill gan-style-harness
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$27

When NOT to Use

  • Quick single-file fixes (use standard claude -p)
  • Tasks with tight budget constraints (<$10)
  • Simple refactoring (use de-sloppify pattern instead)
  • Tasks that are already well-specified with tests (use TDD workflow)

Architecture

┌─────────────┐

                    │   PLANNER   │

                    │  (Opus 4.6) │

                    └──────┬──────┘

                           │ Product Spec

                           │ (features, sprints, design direction)

                           ▼

              ┌────────────────────────┐

              │                        │

              │   GENERATOR-EVALUATOR  │

              │      FEEDBACK LOOP     │

              │                        │

              │  ┌──────────┐          │

              │  │GENERATOR │--build-->│──┐

              │  │(Opus 4.6)│          │  │

              │  └────▲─────┘          │  │

              │       │                │  │ live app

              │    feedback             │  │

              │       │                │  │

              │  ┌────┴─────┐          │  │

              │  │EVALUATOR │<-test----│──┘

              │  │(Opus 4.6)│          │

              │  │+Playwright│         │

              │  └──────────┘          │

              │                        │

              │   5-15 iterations      │

              └────────────────────────┘

The Three Agents

1. Planner Agent

Role: Product manager — expands a brief prompt into a full product specification.

Key behaviors:

  • Takes a one-line prompt and produces a 16-feature, multi-sprint specification
  • Defines user stories, technical requirements, and visual design direction
  • Is deliberately ambitious — conservative planning leads to underwhelming results
  • Produces evaluation criteria that the Evaluator will use later

Model: Opus 4.6 (needs deep reasoning for spec expansion)

2. Generator Agent

Role: Developer — implements features according to the spec.

Key behaviors:

  • Works in structured sprints (or continuous mode with newer models)
  • Negotiates a "sprint contract" with the Evaluator before writing code
  • Uses full-stack tooling: React, FastAPI/Express, databases, CSS
  • Manages git for version control between iterations
  • Reads Evaluator feedback and incorporates it in next iteration

Model: Opus 4.6 (needs strong coding capability)

3. Evaluator Agent

Role: QA engineer — tests the live running application, not just code.

Key behaviors:

  • Uses Playwright MCP to interact with the live application
  • Clicks through features, fills forms, tests API endpoints
  • Scores against four criteria (configurable):
  • Design Quality — Does it feel like a coherent whole?
  • Originality — Custom decisions vs. template/AI patterns?
  • Craft — Typography, spacing, animations, micro-interactions?
  • Functionality — Do all features actually work?
  • Returns structured feedback with scores and specific issues
  • Is engineered to be ruthlessly strict — never praises mediocre work

Model: Opus 4.6 (needs strong judgment + tool use)

Evaluation Criteria

The default four criteria, each scored 1-10:

## Evaluation Rubric

### Design Quality (weight: 0.3)

- 1-3: Generic, template-like, "AI slop" aesthetics

- 4-6: Competent but unremarkable, follows conventions

- 7-8: Distinctive, cohesive visual identity

- 9-10: Could pass for a professional designer's work

### Originality (weight: 0.2)

- 1-3: Default colors, stock layouts, no personality

- 4-6: Some custom choices, mostly standard patterns

- 7-8: Clear creative vision, unique approach

- 9-10: Surprising, delightful, genuinely novel

### Craft (weight: 0.3)

- 1-3: Broken layouts, missing states, no animations

- 4-6: Works but feels rough, inconsistent spacing

- 7-8: Polished, smooth transitions, responsive

- 9-10: Pixel-perfect, delightful micro-interactions

### Functionality (weight: 0.2)

- 1-3: Core features broken or missing

- 4-6: Happy path works, edge cases fail

- 7-8: All features work, good error handling

- 9-10: Bulletproof, handles every edge case

Scoring

  • Weighted score = sum of (criterion_score * weight)
  • Pass threshold = 7.0 (configurable)
  • Max iterations = 15 (configurable, typically 5-15 sufficient)

Usage

Via Command

# Full three-agent harness

/project:gan-build "Build a project management app with Kanban boards, team collaboration, and dark mode"

# With custom config

/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5

# Frontend design mode (generator + evaluator only, no planner)

/project:gan-design "Create a landing page for a crypto portfolio tracker"

Via Shell Script

# Basic usage

./scripts/gan-harness.sh "Build a music streaming dashboard"

# With options

GAN_MAX_ITERATIONS=10 \

GAN_PASS_THRESHOLD=7.5 \

GAN_EVAL_CRITERIA="functionality,performance,security" \

./scripts/gan-harness.sh "Build a REST API for task management"

Via Claude Code (Manual)

# Step 1: Plan

claude -p --model opus "You are a Product Planner. Read PLANNER_PROMPT.md. Expand this brief into a full product spec: 'Build a Kanban board app'. Write spec to spec.md"

# Step 2: Generate (iteration 1)

claude -p --model opus "You are a Generator. Read spec.md. Implement Sprint 1. Start the dev server on port 3000."

# Step 3: Evaluate (iteration 1)

claude -p --model opus --allowedTools "Read,Bash,mcp__playwright__*" "You are an Evaluator. Read EVALUATOR_PROMPT.md. Test the live app at http://localhost:3000. Score against the rubric. Write feedback to feedback-001.md"

# Step 4: Generate (iteration 2 — reads feedback)

claude -p --model opus "You are a Generator. Read spec.md and feedback-001.md. Address all issues. Improve the scores."

# Repeat steps 3-4 until pass threshold met

Evolution Across Model Capabilities

The harness should simplify as models improve. Following Anthropic's evolution:

Stage 1 — Weaker Models (Sonnet-class)

  • Full sprint decomposition required
  • Context resets between sprints (avoid context anxiety)
  • 2-agent minimum: Initializer + Coding Agent
  • Heavy scaffolding compensates for model limitations

Stage 2 — Capable Models (Opus 4.5-class)

  • Full 3-agent harness: Planner + Generator + Evaluator
  • Sprint contracts before each implementation phase
  • 10-sprint decomposition for complex apps
  • Context resets still useful but less critical

Stage 3 — Frontier Models (Opus 4.6-class)

  • Simplified harness: single planning pass, continuous generation
  • Evaluation reduced to single end-pass (model is smarter)
  • No sprint structure needed
  • Automatic compaction handles context growth

Key principle: Every harness component encodes an assumption about what the model can't do alone. When models improve, re-test those assumptions. Strip away what's no longer needed.

Configuration

Environment Variables

Variable

Default

Description

GAN_MAX_ITERATIONS

15

Maximum generator-evaluator cycles

GAN_PASS_THRESHOLD

7.0

Weighted score to pass (1-10)

GAN_PLANNER_MODEL

opus

Model for planning agent

GAN_GENERATOR_MODEL

opus

Model for generator agent

GAN_EVALUATOR_MODEL

opus

Model for evaluator agent

GAN_EVAL_CRITERIA

design,originality,craft,functionality

Comma-separated criteria

GAN_DEV_SERVER_PORT

3000

Port for the live app

GAN_DEV_SERVER_CMD

npm run dev

Command to start dev server

GAN_PROJECT_DIR

.

Project working directory

GAN_SKIP_PLANNER

false

Skip planner, use spec directly

GAN_EVAL_MODE

playwright

playwright, screenshot, or code-only

Evaluation Modes

Mode

Tools

Best For

playwright

Browser MCP + live interaction

Full-stack apps with UI

screenshot

Screenshot + visual analysis

Static sites, design-only

code-only

Tests + linting + build

APIs, libraries, CLI tools

Anti-Patterns

-

Evaluator too lenient — If the evaluator passes everything on iteration 1, your rubric is too generous. Tighten scoring criteria and add explicit penalties for common AI patterns.

-

Generator ignoring feedback — Ensure feedback is passed as a file, not inline. The generator should read feedback-NNN.md at the start of each iteration.

-

Infinite loops — Always set GAN_MAX_ITERATIONS. If the generator can't improve past a score plateau after 3 iterations, stop and flag for human review.

-

Evaluator testing superficially — The evaluator must use Playwright to interact with the live app, not just screenshot it. Click buttons, fill forms, test error states.

-

Evaluator praising its own fixes — Never let the evaluator suggest fixes and then evaluate those fixes. The evaluator only critiques; the generator fixes.

-

Context exhaustion — For long sessions, use Claude Agent SDK's automatic compaction or reset context between major phases.

Results: What to Expect

Based on Anthropic's published results:

Metric

Solo Agent

GAN Harness

Improvement

Time

20 min

4-6 hours

12-18x longer

Cost

$9

$125-200

14-22x more

Quality

Barely functional

Production-ready

Phase change

Core features

Broken

All working

N/A

Design

Generic AI slop

Distinctive, polished

N/A

The tradeoff is clear: ~20x more time and cost for a qualitative leap in output quality. This is for projects where quality matters.

References

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card