agentic-eval

Iterative evaluation and refinement patterns for improving AI agent outputs through self-critique loops. Provides three core patterns: basic reflection (self-critique loops), evaluator-optimizer (separated generation and evaluation), and code-specific test-driven refinement Supports multiple evaluation strategies including outcome-based assessment, LLM-as-judge comparison, and rubric-based scoring with weighted dimensions Includes practical Python implementations with structured JSON output parsing, iteration limits, and convergence detection to prevent infinite loops Best suited for quality-critical tasks like code generation, reports, and analysis where clear evaluation criteria and success metrics exist

INSTALLATION

npx skills add https://github.com/github/awesome-copilot --skill agentic-eval

Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Agentic Evaluation Patterns

Name: agentic-eval
Author: github

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

Generate → Evaluate → Critique → Refine → Output

    ↑                              │

    └──────────────────────────────┘

When to Use

Quality-critical generation: Code, reports, analysis requiring high accuracy

Tasks with clear evaluation criteria: Defined success metrics exist

Content requiring specific standards: Style guides, compliance, formatting

Pattern 1: Basic Reflection

Agent evaluates and improves its own output through self-critique.

def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:

    """Generate with reflection loop."""

    output = llm(f"Complete this task:\n{task}")

    for i in range(max_iterations):

        # Self-critique

        critique = llm(f"""

        Evaluate this output against criteria: {criteria}

        Output: {output}

        Rate each: PASS/FAIL with feedback as JSON.

        """)

        critique_data = json.loads(critique)

        all_pass = all(c["status"] == "PASS" for c in critique_data.values())

        if all_pass:

            return output

        # Refine based on critique

        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}

        output = llm(f"Improve to address: {failed}\nOriginal: {output}")

    return output

Key insight: Use structured JSON output for reliable parsing of critique results.

Pattern 2: Evaluator-Optimizer

Separate generation and evaluation into distinct components for clearer responsibilities.

class EvaluatorOptimizer:

    def __init__(self, score_threshold: float = 0.8):

        self.score_threshold = score_threshold

    def generate(self, task: str) -> str:

        return llm(f"Complete: {task}")

    def evaluate(self, output: str, task: str) -> dict:

        return json.loads(llm(f"""

        Evaluate output for task: {task}

        Output: {output}

        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}

        """))

    def optimize(self, output: str, feedback: dict) -> str:

        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")

    def run(self, task: str, max_iterations: int = 3) -> str:

        output = self.generate(task)

        for _ in range(max_iterations):

            evaluation = self.evaluate(output, task)

            if evaluation["overall_score"] >= self.score_threshold:

                break

            output = self.optimize(output, evaluation)

        return output

Pattern 3: Code-Specific Reflection

Test-driven refinement loop for code generation.

class CodeReflector:

    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:

        code = llm(f"Write Python code for: {spec}")

        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")

        for _ in range(max_iterations):

            result = run_tests(code, tests)

            if result["success"]:

                return code

            code = llm(f"Fix error: {result['error']}\nCode: {code}")

        return code

Evaluation Strategies

Outcome-Based

Evaluate whether output achieves the expected result.

def evaluate_outcome(task: str, output: str, expected: str) -> str:

    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.

def llm_judge(output_a: str, output_b: str, criteria: str) -> str:

    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.

RUBRIC = {

    "accuracy": {"weight": 0.4},

    "clarity": {"weight": 0.3},

    "completeness": {"weight": 0.3}

}

def evaluate_with_rubric(output: str, rubric: dict) -> float:

    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))

    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

Best Practices

Practice

Rationale

Clear criteria

Define specific, measurable evaluation criteria upfront

Iteration limits

Set max iterations (3-5) to prevent infinite loops

Convergence check

Stop if output score isn't improving between iterations

Log history

Keep full trajectory for debugging and analysis

Structured output

Use JSON for reliable parsing of evaluation results

Quick Start Checklist

## Evaluation Implementation Checklist

### Setup

- [ ] Define evaluation criteria/rubric

- [ ] Set score threshold for "good enough"

- [ ] Configure max iterations (default: 3)

### Implementation

- [ ] Implement generate() function

- [ ] Implement evaluate() function with structured output

- [ ] Implement optimize() function

- [ ] Wire up the refinement loop

### Safety

- [ ] Add convergence detection

- [ ] Log all iterations for debugging

- [ ] Handle evaluation parse failures gracefully