agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

INSTALLATION

npx skills add https://github.com/affaan-m/everything-claude-code --skill agent-eval

Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Agent Eval Skill

Name: agent-eval
Author: affaan-m

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

When to Activate

Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase

Measuring agent performance before adopting a new tool or model

Running regression checks when an agent updates its model or tooling

Producing data-backed agent selection decisions for a team

Installation

Note: Install agent-eval from its repository after reviewing the source.

Core Concepts

YAML Task Definitions

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

name: add-retry-logic

description: Add exponential backoff retry to the HTTP client

repo: ./my-project

files:

  - src/http_client.py

prompt: |

  Add retry logic with exponential backoff to all HTTP requests.

  Max 3 retries. Initial delay 1s, max delay 30s.

judge:

  - type: pytest

    command: pytest tests/test_http_client.py -v

  - type: grep

    pattern: "exponential_backoff|retry"

    files: src/http_client.py

commit: "abc1234"  # pin to specific commit for reproducibility

Git Worktree Isolation

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

Metrics Collected

Metric

What It Measures

Pass rate

Did the agent produce code that passes the judge?

Cost

API spend per task (when available)

Time

Wall-clock seconds to completion

Consistency

Pass rate across repeated runs (e.g., 3/3 = 100%)

Workflow

1. Define Tasks

Create a tasks/ directory with YAML files, one per task:

mkdir tasks

# Write task definitions (see template above)

2. Run Agents

Execute agents against your tasks:

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

Each run:

Creates a fresh git worktree from the specified commit

Hands the prompt to the agent

Runs the judge criteria

Records pass/fail, cost, and time

3. Compare Results

Generate a comparison report:

agent-eval report --format table

Task: add-retry-logic (3 runs each)

┌──────────────┬───────────┬────────┬────────┬─────────────┐

│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │

├──────────────┼───────────┼────────┼────────┼─────────────┤

│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │

│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │

└──────────────┴───────────┴────────┴────────┴─────────────┘

Judge Types

Code-Based (deterministic)

judge:

  - type: pytest

    command: pytest tests/ -v

  - type: command

    command: npm run build

Pattern-Based

judge:

  - type: grep

    pattern: "class.*Retry"

    files: src/**/*.py

Model-Based (LLM-as-judge)

judge:

  - type: llm

    prompt: |

      Does this implementation correctly handle exponential backoff?

      Check for: max retries, increasing delays, jitter.

Best Practices

Start with 3-5 tasks that represent your real workload, not toy examples

Run at least 3 trials per agent to capture variance — agents are non-deterministic

Pin the commit in your task YAML so results are reproducible across days/weeks

Include at least one deterministic judge (tests, build) per task — LLM judges add noise

Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice

Version your task definitions — they are test fixtures, treat them as code

agent-eval

SKILL.md

Agent Eval Skill

When to Activate

Installation

Core Concepts

YAML Task Definitions

Git Worktree Isolation

Metrics Collected

Workflow

1. Define Tasks

2. Run Agents

3. Compare Results

Judge Types

Code-Based (deterministic)

Pattern-Based

Model-Based (LLM-as-judge)

Best Practices

Links

Stop writing automation&scrapers

agent-eval

SKILL.md

Agent Eval Skill

When to Activate

Installation

Core Concepts

YAML Task Definitions

Git Worktree Isolation

Metrics Collected

Workflow

1. Define Tasks

2. Run Agents

3. Compare Results

Judge Types

Code-Based (deterministic)

Pattern-Based

Model-Based (LLM-as-judge)

Best Practices

Links

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers