adk-eval-guide

Comprehensive evaluation methodology guide for ADK agents covering metrics, schemas, and iteration workflows. Provides eight evaluation criteria (tool trajectory, response matching, rubric-based scoring, hallucination detection, safety) with configurable thresholds and judge model options Includes evalset schema documentation with multi-turn conversation support, tool use trajectory specification, and session state initialization patterns Outlines the eval-fix loop: start small, run evaluation, diagnose failures, fix code or evalset, iterate until threshold met Documents common failure causes (trajectory gaps, state type mismatches, app name mismatches, model thinking mode conflicts) with specific remediation steps References four supplementary guides covering detailed metrics, user simulation, built-in tools behavior, and multimodal evaluation patterns

INSTALLATION
npx skills add https://github.com/google/adk-docs --skill adk-eval-guide
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$2c

Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.

How to iterate

  • Start small: Begin with 1-2 eval cases, not the full suite
  • Run eval: make eval (or adk eval if no Makefile)
  • Read the scores — identify what failed and why
  • Fix the code — adjust prompts, tool logic, instructions, or the evalset
  • Rerun eval — verify the fix worked
  • Repeat steps 3-5 until the case passes
  • Only then add more eval cases and expand coverage

Expect 5-10+ iterations. This is normal — each iteration makes the agent better.

What to fix when scores fail

Failure

What to change

tool_trajectory_avg_score low

Fix agent instructions (tool ordering), update evalset tool_uses, or switch to IN_ORDER/ANY_ORDER match type

response_match_score low

Adjust agent instruction wording, or relax the expected response

final_response_match_v2 low

Refine agent instructions, or adjust expected response — this is semantic, not lexical

rubric_based score low

Refine agent instructions to address the specific rubric that failed

hallucinations_v1 low

Tighten agent instructions to stay grounded in tool output

Agent calls wrong tools

Fix tool descriptions, agent instructions, or tool_config

Agent calls extra tools

Use IN_ORDER/ANY_ORDER match type, add strict stop instructions, or switch to rubric_based_tool_use_quality_v1

Choosing the Right Criteria

Goal

Recommended Metric

Regression testing / CI/CD (fast, deterministic)

tool_trajectory_avg_score + response_match_score

Semantic response correctness (flexible phrasing OK)

final_response_match_v2

Response quality without reference answer

rubric_based_final_response_quality_v1

Validate tool usage reasoning

rubric_based_tool_use_quality_v1

Detect hallucinated claims

hallucinations_v1

Safety compliance

safety_v1

Dynamic multi-turn conversations

User simulation + hallucinations_v1 / safety_v1 (see references/user-simulation.md)

Multimodal input (image, audio, file)

tool_trajectory_avg_score + custom metric for response quality (see references/multimodal-eval.md)

For the complete metrics reference with config examples, match types, and custom metrics, see references/criteria-guide.md.

Running Evaluations

# Scaffolded projects:

make eval EVALSET=tests/eval/evalsets/my_evalset.json

# Or directly via ADK CLI:

adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results

# Run specific eval cases from a set:

adk eval ./app my_evalset.json:eval_1,eval_2

# With GCS storage:

adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals

CLI options: --config_file_path, --print_detailed_results, --eval_storage_uri, --log_level

Eval set management:

adk eval_set create <agent_path> <eval_set_id>

adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>

Configuration Schema ( eval_config.json )

Both camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs.

Full example

{

  "criteria": {

    "tool_trajectory_avg_score": {

      "threshold": 1.0,

      "match_type": "IN_ORDER"

    },

    "final_response_match_v2": {

      "threshold": 0.8,

      "judge_model_options": {

        "judge_model": "gemini-2.5-flash",

        "num_samples": 5

      }

    },

    "rubric_based_final_response_quality_v1": {

      "threshold": 0.8,

      "rubrics": [

        {

          "rubric_id": "professionalism",

          "rubric_content": { "text_property": "The response must be professional and helpful." }

        },

        {

          "rubric_id": "safety",

          "rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }

        }

      ]

    }

  }

}

Simple threshold shorthand is also valid: "response_match_score": 0.8

For custom metrics, judge_model_options details, and user_simulator_config, see references/criteria-guide.md.

EvalSet Schema ( evalset.json )

{

  "eval_set_id": "my_eval_set",

  "name": "My Eval Set",

  "description": "Tests core capabilities",

  "eval_cases": [

    {

      "eval_id": "search_test",

      "conversation": [

        {

          "invocation_id": "inv_1",

          "user_content": { "parts": [{ "text": "Find a flight to NYC" }] },

          "final_response": {

            "role": "model",

            "parts": [{ "text": "I found a flight for $500. Want to book?" }]

          },

          "intermediate_data": {

            "tool_uses": [

              { "name": "search_flights", "args": { "destination": "NYC" } }

            ],

            "intermediate_responses": [

              ["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]

            ]

          }

        }

      ],

      "session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }

    }

  ]

}

Key fields:

  • intermediate_data.tool_uses — expected tool call trajectory (chronological order)
  • intermediate_data.intermediate_responses — expected sub-agent responses (for multi-agent systems)
  • session_input.state — initial session state (overrides Python-level initialization)
  • conversation_scenario — alternative to conversation for user simulation (see references/user-simulation.md)

Common Gotchas

The Proactivity Trajectory Gap

LLMs often perform extra actions not asked for (e.g., google_search after save_preferences). This causes tool_trajectory_avg_score failures with EXACT match. Solutions:

  • **Use IN_ORDER or ANY_ORDER match type** — tolerates extra tool calls between expected ones
  • Include ALL tools the agent might call in your expected trajectory
  • Use rubric_based_tool_use_quality_v1 instead of trajectory matching
  • Add strict stop instructions: "Stop after calling save_preferences. Do NOT search."

Multi-turn conversations require tool_uses for ALL turns

The tool_trajectory_avg_score evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools.

{

  "conversation": [

    {

      "invocation_id": "inv_1",

      "user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },

      "intermediate_data": {

        "tool_uses": [

          { "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }

        ]

      }

    },

    {

      "invocation_id": "inv_2",

      "user_content": { "parts": [{"text": "Book the first option"}] },

      "final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },

      "intermediate_data": {

        "tool_uses": [

          { "name": "book_flight", "args": {"flight_id": "1"} }

        ]

      }

    }

  ]

}

App name must match directory name

The App object's name parameter MUST match the directory containing your agent:

# CORRECT - matches the "app" directory

app = App(root_agent=root_agent, name="app")

# WRONG - causes "Session not found" errors

app = App(root_agent=root_agent, name="flight_booking_assistant")

The before_agent_callback Pattern (State Initialization)

Always use a callback to initialize session state variables used in your instruction template. This prevents KeyError crashes on the first turn:

async def initialize_state(callback_context: CallbackContext) -> None:

    state = callback_context.state

    if "user_preferences" not in state:

        state["user_preferences"] = {}

root_agent = Agent(

    name="my_agent",

    before_agent_callback=initialize_state,

    instruction="Based on preferences: {user_preferences}...",

)

Eval-State Overrides (Type Mismatch Danger)

Be careful with session_input.state in your evalset. It overrides Python-level initialization:

// WRONG — initializes feedback_history as a string, breaks .append()

"state": { "feedback_history": "" }

// CORRECT — matches the Python type (list)

"state": { "feedback_history": [] }

// NOTE: Remove these // comments before using — JSON does not support comments.

Model thinking mode may bypass tools

Models with "thinking" enabled may skip tool calls. Use tool_config with mode="ANY" to force tool usage, or switch to a non-thinking model for predictable tool calling.

Common Eval Failure Causes

Symptom

Cause

Fix

Missing tool_uses in intermediate turns

Trajectory expects match per invocation

Add expected tool calls to all turns

Agent mentions data not in tool output

Hallucination

Tighten agent instructions; add hallucinations_v1 metric

"Session not found" error

App name mismatch

Ensure App name matches directory name

Score fluctuates between runs

Non-deterministic model

Set temperature=0 or use rubric-based eval

tool_trajectory_avg_score always 0

Agent uses google_search (model-internal)

Remove trajectory metric; see references/builtin-tools-eval.md

Trajectory fails but tools are correct

Extra tools called

Switch to IN_ORDER/ANY_ORDER match type

LLM judge ignores image/audio in eval

get_text_from_content() skips non-text parts

Use custom metric with vision-capable judge (see references/multimodal-eval.md)

Deep Dive: ADK Docs

For the official evaluation documentation, fetch these pages:

  • Evaluation overview: https://adk.dev/evaluate/index.md
  • Criteria reference: https://adk.dev/evaluate/criteria/index.md
  • User simulation: https://adk.dev/evaluate/user-sim/index.md

Debugging Example

User says: "tool_trajectory_avg_score is 0, what's wrong?"

  • Check if agent uses google_search — if so, see references/builtin-tools-eval.md
  • Check if using EXACT match and agent calls extra tools — try IN_ORDER
  • Compare expected tool_uses in evalset with actual agent behavior
  • Fix mismatch (update evalset or agent instructions)
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card