phoenix-evals

Build and run evaluators for AI/LLM applications using Phoenix.

INSTALLATION
npx skills add https://github.com/arize-ai/phoenix --skill phoenix-evals
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

Task

Files

Setup

setup-python, setup-typescript

Decide what to evaluate

evaluators-overview

Choose a judge model

fundamentals-model-selection

Use pre-built evaluators

evaluators-pre-built

Build code evaluator

evaluators-code-python, evaluators-code-typescript

Build LLM evaluator

evaluators-llm-python, evaluators-llm-typescript, evaluators-custom-templates

Batch evaluate DataFrame

evaluate-dataframe-python

Run experiment

experiments-running-python, experiments-running-typescript

Create dataset

experiments-datasets-python, experiments-datasets-typescript

Generate synthetic data

experiments-synthetic-python, experiments-synthetic-typescript

Validate evaluator accuracy

validation, validation-evaluators-python, validation-evaluators-typescript

Sample traces for review

observe-sampling-python, observe-sampling-typescript

Analyze errors

error-analysis, error-analysis-multi-turn, axial-coding

RAG evals

evaluators-rag

Avoid common mistakes

common-mistakes-python, fundamentals-anti-patterns

Production

production-overview, production-guardrails, production-continuous

Workflows

Starting Fresh:

observe-tracing-setuperror-analysisaxial-codingevaluators-overview

Building Evaluator:

fundamentalscommon-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}

RAG Systems:

evaluators-rag → evaluators-code- (retrieval) → evaluators-llm- (faithfulness)

Production:

production-overviewproduction-guardrailsproduction-continuous

Reference Categories

Prefix

Description

fundamentals-*

Types, scores, anti-patterns

observe-*

Tracing, sampling

error-analysis-*

Finding failures

axial-coding-*

Categorizing failures

evaluators-*

Code, LLM, RAG evaluators

experiments-*

Datasets, running experiments

validation-*

Validating evaluator accuracy against human labels

production-*

CI/CD, monitoring

Key Principles

Principle

Action

Error analysis first

Can't automate what you haven't observed

Custom > generic

Build from your failures

Code first

Deterministic before LLM

Validate judges

>80% TPR/TNR

Binary > Likert

Pass/fail, not 1-5

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card