SKILL.md

Eval Harness Skill

A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.

When to Activate

Setting up eval-driven development (EDD) for AI-assisted workflows

Defining pass/fail criteria for Claude Code task completion

Measuring agent reliability with pass@k metrics

Creating regression test suites for prompt or agent changes

Benchmarking agent performance across model versions

Philosophy

Eval-Driven Development treats evals as the "unit tests of AI development":

Define expected behavior BEFORE implementation

Run evals continuously during development

Track regressions with each change

Use pass@k metrics for reliability measurement

Eval Types

Capability Evals

Test if Claude can do something it couldn't before:

[CAPABILITY EVAL: feature-name]

Task: Description of what Claude should accomplish

Success Criteria:

  - [ ] Criterion 1

  - [ ] Criterion 2

  - [ ] Criterion 3

Expected Output: Description of expected result

Regression Evals

Ensure changes don't break existing functionality:

[REGRESSION EVAL: feature-name]

Baseline: SHA or checkpoint name

Tests:

  - existing-test-1: PASS/FAIL

  - existing-test-2: PASS/FAIL

  - existing-test-3: PASS/FAIL

Result: X/Y passed (previously Y/Y)

Grader Types

1. Code-Based Grader

Deterministic checks using code:

# Check if file contains expected pattern

grep -q "export function handleAuth" src/auth.ts &#x26;&#x26; echo "PASS" || echo "FAIL"

# Check if tests pass

npm test -- --testPathPattern="auth" &#x26;&#x26; echo "PASS" || echo "FAIL"

# Check if build succeeds

npm run build &#x26;&#x26; echo "PASS" || echo "FAIL"

2. Model-Based Grader

Use Claude to evaluate open-ended outputs:

[MODEL GRADER PROMPT]

Evaluate the following code change:

1. Does it solve the stated problem?

2. Is it well-structured?

3. Are edge cases handled?

4. Is error handling appropriate?

Score: 1-5 (1=poor, 5=excellent)

Reasoning: [explanation]

3. Human Grader

Flag for manual review:

[HUMAN REVIEW REQUIRED]

Change: Description of what changed

Reason: Why human review is needed

Risk Level: LOW/MEDIUM/HIGH

Metrics

pass@k

"At least one success in k attempts"

pass@1: First attempt success rate

pass@3: Success within 3 attempts

Typical target: pass@3 > 90%

pass^k

"All k trials succeed"

Higher bar for reliability

pass^3: 3 consecutive successes

Use for critical paths

Eval Workflow

1. Define (Before Coding)

## EVAL DEFINITION: feature-xyz

### Capability Evals

1. Can create new user account

2. Can validate email format

3. Can hash password securely

### Regression Evals

1. Existing login still works

2. Session management unchanged

3. Logout flow intact

### Success Metrics

- pass@3 > 90% for capability evals

- pass^3 = 100% for regression evals

2. Implement

Write code to pass the defined evals.

3. Evaluate

# Run capability evals

[Run each capability eval, record PASS/FAIL]

# Run regression evals

npm test -- --testPathPattern="existing"

# Generate report

4. Report

EVAL REPORT: feature-xyz

========================

Capability Evals:

  create-user:     PASS (pass@1)

  validate-email:  PASS (pass@2)

  hash-password:   PASS (pass@1)

  Overall:         3/3 passed

Regression Evals:

  login-flow:      PASS

  session-mgmt:    PASS

  logout-flow:     PASS

  Overall:         3/3 passed

Metrics:

  pass@1: 67% (2/3)

  pass@3: 100% (3/3)

Status: READY FOR REVIEW

Integration Patterns

Pre-Implementation

/eval define feature-name

Creates eval definition file at .claude/evals/feature-name.md

During Implementation

/eval check feature-name

Runs current evals and reports status

Post-Implementation

/eval report feature-name

Generates full eval report

Eval Storage

Store evals in project:

.claude/

  evals/

    feature-xyz.md      # Eval definition

    feature-xyz.log     # Eval run history

    baseline.json       # Regression baselines

Best Practices

Define evals BEFORE coding - Forces clear thinking about success criteria

Run evals frequently - Catch regressions early

Track pass@k over time - Monitor reliability trends

Use code graders when possible - Deterministic > probabilistic

Human review for security - Never fully automate security checks

Keep evals fast - Slow evals don't get run

Version evals with code - Evals are first-class artifacts

Example: Adding Authentication

## EVAL: add-authentication

### Phase 1: Define (10 min)

Capability Evals:

- [ ] User can register with email/password

- [ ] User can login with valid credentials

- [ ] Invalid credentials rejected with proper error

- [ ] Sessions persist across page reloads

- [ ] Logout clears session

Regression Evals:

- [ ] Public routes still accessible

- [ ] API responses unchanged

- [ ] Database schema compatible

### Phase 2: Implement (varies)

[Write code]

### Phase 3: Evaluate

Run: /eval check add-authentication

### Phase 4: Report

EVAL REPORT: add-authentication

==============================

Capability: 5/5 passed (pass@3: 100%)

Regression: 3/3 passed (pass^3: 100%)

Status: SHIP IT

Product Evals (v1.8)

Use product evals when behavior quality cannot be captured by unit tests alone.

Grader Types

Code grader (deterministic assertions)

Rule grader (regex/schema constraints)

Model grader (LLM-as-judge rubric)

Human grader (manual adjudication for ambiguous outputs)

pass@k Guidance

pass@1: direct reliability

pass@3: practical reliability under controlled retries

pass^3: stability test (all 3 runs must pass)

Recommended thresholds:

Capability evals: pass@3 >= 0.90

Regression evals: pass^3 = 1.00 for release-critical paths

Eval Anti-Patterns

Overfitting prompts to known eval examples

Measuring only happy-path outputs

Ignoring cost and latency drift while chasing pass rates

Allowing flaky graders in release gates

Minimal Eval Artifact Layout

.claude/evals/<feature>.md definition

.claude/evals/<feature>.log run history

docs/releases/<version>/eval-summary.md release snapshot

eval-harness

SKILL.md

Eval Harness Skill

When to Activate

Philosophy

Eval Types

Capability Evals

Regression Evals

Grader Types

1. Code-Based Grader

2. Model-Based Grader

3. Human Grader

Metrics

pass@k

pass^k

Eval Workflow

1. Define (Before Coding)

2. Implement

3. Evaluate

4. Report

Integration Patterns

Pre-Implementation

During Implementation

Post-Implementation

Eval Storage

Best Practices

Example: Adding Authentication

Product Evals (v1.8)

Grader Types

pass@k Guidance

Eval Anti-Patterns

Minimal Eval Artifact Layout

Stop writing automation&scrapers

eval-harness

SKILL.md

Eval Harness Skill

When to Activate

Philosophy

Eval Types

Capability Evals

Regression Evals

Grader Types

1. Code-Based Grader

2. Model-Based Grader

3. Human Grader

Metrics

pass@k

pass^k

Eval Workflow

1. Define (Before Coding)

2. Implement

3. Evaluate

4. Report

Integration Patterns

Pre-Implementation

During Implementation

Post-Implementation

Eval Storage

Best Practices

Example: Adding Authentication

Product Evals (v1.8)

Grader Types

pass@k Guidance

Eval Anti-Patterns

Minimal Eval Artifact Layout

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers