Name: google-agents-cli-eval
Author: google

SKILL.md

$28

Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.

How to iterate

Start small: Begin with 1-2 eval cases, not the full suite

Run eval: agents-cli eval run

Read the scores — identify what failed and why

Fix the code — adjust prompts, tool logic, instructions, or the evalset

Rerun eval — verify the fix worked

Repeat steps 3-5 until the case passes

Only then add more eval cases and expand coverage

Expect 5-10+ iterations. This is normal — each iteration makes the agent better.

Task tracking: When doing 5+ eval-fix iterations, use a task list to track which cases you've fixed, which are still failing, and what you've tried. This prevents re-attempting the same fix or losing track of regression across iterations.

Shortcuts That Waste Time

Recognize these rationalizations and push back — they always cost more time than they save:

Shortcut

Why it fails

"I'll tune the eval thresholds down to make it pass"

Lowering thresholds hides real failures. If the agent can't meet the bar, fix the agent — don't move the bar.

"This eval case is flaky, I'll skip it"

Flaky evals reveal non-determinism in your agent. Fix with temperature=0, rubric-based metrics, or more specific instructions — don't delete the signal.

"I just need to fix the evalset, not the agent"

If you're always adjusting expected outputs, your agent has a behavior problem. Fix the instructions or tool logic first.

What to fix when scores fail

Failure

What to change

tool_trajectory_avg_score low

Fix agent instructions (tool ordering), update evalset tool_uses, or switch to IN_ORDER/ANY_ORDER match type

response_match_score low

Adjust agent instruction wording, or relax the expected response

final_response_match_v2 low

Refine agent instructions, or adjust expected response — this is semantic, not lexical

rubric_based score low

Refine agent instructions to address the specific rubric that failed

hallucinations_v1 low

Tighten agent instructions to stay grounded in tool output

Agent calls wrong tools

Fix tool descriptions, agent instructions, or tool_config

Agent calls extra tools

Use IN_ORDER/ANY_ORDER match type, add strict stop instructions, or switch to rubric_based_tool_use_quality_v1

Choosing the Right Criteria

Goal

Recommended Metric

Regression testing / CI/CD (fast, deterministic)

tool_trajectory_avg_score + response_match_score

Semantic response correctness (flexible phrasing OK)

final_response_match_v2

Response quality without reference answer

rubric_based_final_response_quality_v1

Validate tool usage reasoning

rubric_based_tool_use_quality_v1

Detect hallucinated claims

hallucinations_v1

Safety compliance

safety_v1

Dynamic multi-turn conversations

User simulation + hallucinations_v1 / safety_v1 (see references/user-simulation.md)

Multimodal input (image, audio, file)

tool_trajectory_avg_score + custom metric for response quality (see references/multimodal-eval.md)

For the complete metrics reference with config examples, match types, and custom metrics, see references/criteria-guide.md.

Running Evaluations

# Scaffolded projects — agents-cli:

agents-cli eval run --evalset tests/eval/evalsets/my_evalset.json

# With explicit config file:

agents-cli eval run --evalset tests/eval/evalsets/my_evalset.json --config tests/eval/eval_config.json

# Run all evalsets in tests/eval/evalsets/:

agents-cli eval run --all

**agents-cli eval run options:** --evalset PATH, --config PATH, --all

Compare two result files:

agents-cli eval compare baseline.json candidate.json

Configuration Schema ( eval_config.json )

Both camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs.

Full example

{

  "criteria": {

    "tool_trajectory_avg_score": {

      "threshold": 1.0,

      "match_type": "IN_ORDER"

    },

    "final_response_match_v2": {

      "threshold": 0.8,

      "judge_model_options": {

        "judge_model": "gemini-flash-latest",

        "num_samples": 5

      }

    },

    "rubric_based_final_response_quality_v1": {

      "threshold": 0.8,

      "rubrics": [

        {

          "rubric_id": "professionalism",

          "rubric_content": { "text_property": "The response must be professional and helpful." }

        },

        {

          "rubric_id": "safety",

          "rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }

        }

      ]

    }

  }

}

Simple threshold shorthand is also valid: "response_match_score": 0.8

For custom metrics, judge_model_options details, and user_simulator_config, see references/criteria-guide.md.

EvalSet Schema ( evalset.json )

{

  "eval_set_id": "my_eval_set",

  "name": "My Eval Set",

  "description": "Tests core capabilities",

  "eval_cases": [

    {

      "eval_id": "search_test",

      "conversation": [

        {

          "invocation_id": "inv_1",

          "user_content": { "parts": [{ "text": "Find a flight to NYC" }] },

          "final_response": {

            "role": "model",

            "parts": [{ "text": "I found a flight for $500. Want to book?" }]

          },

          "intermediate_data": {

            "tool_uses": [

              { "name": "search_flights", "args": { "destination": "NYC" } }

            ],

            "intermediate_responses": [

              ["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]

            ]

          }

        }

      ],

      "session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }

    }

  ]

}

Key fields:

intermediate_data.tool_uses — expected tool call trajectory (chronological order)

intermediate_data.intermediate_responses — expected sub-agent responses (for multi-agent systems)

session_input.state — initial session state (overrides Python-level initialization)

conversation_scenario — alternative to conversation for user simulation (see references/user-simulation.md)

Common Gotchas

The Proactivity Trajectory Gap

LLMs often perform extra actions not asked for (e.g., google_search after save_preferences). This causes tool_trajectory_avg_score failures with EXACT match. Solutions:

**Use IN_ORDER or ANY_ORDER match type** — tolerates extra tool calls between expected ones

Include ALL tools the agent might call in your expected trajectory

Use rubric_based_tool_use_quality_v1 instead of trajectory matching

Add strict stop instructions: "Stop after calling save_preferences. Do NOT search."

Multi-turn conversations require tool_uses for ALL turns

The tool_trajectory_avg_score evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools.

{

  "conversation": [

    {

      "invocation_id": "inv_1",

      "user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },

      "intermediate_data": {

        "tool_uses": [

          { "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }

        ]

      }

    },

    {

      "invocation_id": "inv_2",

      "user_content": { "parts": [{"text": "Book the first option"}] },

      "final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },

      "intermediate_data": {

        "tool_uses": [

          { "name": "book_flight", "args": {"flight_id": "1"} }

        ]

      }

    }

  ]

}

App name must match directory name

The App object's name parameter MUST match the directory containing your agent:

# CORRECT - matches the "app" directory

app = App(root_agent=root_agent, name="app")

# WRONG - causes "Session not found" errors

app = App(root_agent=root_agent, name="flight_booking_assistant")

The before_agent_callback Pattern (State Initialization)

Always use a callback to initialize session state variables used in your instruction template. This prevents KeyError crashes on the first turn:

async def initialize_state(callback_context: CallbackContext) -> None:

    state = callback_context.state

    if "user_preferences" not in state:

        state["user_preferences"] = {}

root_agent = Agent(

    name="my_agent",

    before_agent_callback=initialize_state,

    instruction="Based on preferences: {user_preferences}...",

)

Eval-State Overrides (Type Mismatch Danger)

Be careful with session_input.state in your evalset. It overrides Python-level initialization:

WRONG — initializes feedback_history as a string, breaks .append():

"state": { "feedback_history": "" }

CORRECT — matches the Python type (list):

"state": { "feedback_history": [] }

Model thinking mode may bypass tools

Models with "thinking" enabled may skip tool calls. Use tool_config with mode="ANY" to force tool usage, or switch to a non-thinking model for predictable tool calling.

Common Eval Failure Causes

Symptom

Cause

Fix

Missing tool_uses in intermediate turns

Trajectory expects match per invocation

Add expected tool calls to all turns

Agent mentions data not in tool output

Hallucination

Tighten agent instructions; add hallucinations_v1 metric

"Session not found" error

App name mismatch

Ensure App name matches directory name

Score fluctuates between runs

Non-deterministic model

Set temperature=0 or use rubric-based eval

tool_trajectory_avg_score always 0

Agent uses google_search (model-internal)

Remove trajectory metric; see references/builtin-tools-eval.md

Trajectory fails but tools are correct

Extra tools called

Switch to IN_ORDER/ANY_ORDER match type

LLM judge ignores image/audio in eval

get_text_from_content() skips non-text parts

Use custom metric with vision-capable judge (see references/multimodal-eval.md)

Deep Dive: ADK Docs

For the official evaluation documentation, fetch these pages:

Evaluation overview: https://adk.dev/evaluate/index.md

Criteria reference: https://adk.dev/evaluate/criteria/index.md

User simulation: https://adk.dev/evaluate/user-sim/index.md

Debugging Example

User says: "tool_trajectory_avg_score is 0, what's wrong?"

Check if agent uses google_search — if so, see references/builtin-tools-eval.md

Check if using EXACT match and agent calls extra tools — try IN_ORDER

Compare expected tool_uses in evalset with actual agent behavior

Fix mismatch (update evalset or agent instructions)

Proving Your Work

Don't assert that eval passes — show the evidence. Concrete output prevents false confidence and catches issues early.

After running eval: Paste the scores table output so the user can see exactly what passed and failed.

After fixing a failure: Show before/after scores for the specific case you fixed, and confirm no other cases regressed.

Before declaring "eval passes": Confirm ALL cases pass, not just the one you were working on. Run agents-cli eval run (or agents-cli eval run --all) one final time.

Before moving to deploy: Show the final agents-cli eval run output with all cases above threshold. This is the gate — no exceptions.

Related Skills

/google-agents-cli-workflow — Development workflow and the spec-driven build-evaluate-deploy lifecycle

/google-agents-cli-adk-code — ADK Python API quick reference for writing agent code

/google-agents-cli-scaffold — Project creation and enhancement with agents-cli scaffold create / scaffold enhance

/google-agents-cli-deploy — Deployment targets, CI/CD pipelines, and production workflows

/google-agents-cli-observability — Cloud Trace, logging, and monitoring for debugging agent behavior

google-agents-cli-eval