langsmith-observability

LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring…

INSTALLATION
npx skills add https://github.com/davila7/claude-code-templates --skill langsmith-observability
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

LangSmith - LLM Observability Platform

Development platform for debugging, evaluating, and monitoring language models and AI applications.

When to use LangSmith

Use LangSmith when:

  • Debugging LLM application issues (prompts, chains, agents)
  • Evaluating model outputs systematically against datasets
  • Monitoring production LLM systems
  • Building regression testing for AI features
  • Analyzing latency, token usage, and costs
  • Collaborating on prompt engineering

Key features:

  • Tracing: Capture inputs, outputs, latency for all LLM calls
  • Evaluation: Systematic testing with built-in and custom evaluators
  • Datasets: Create test sets from production traces or manually
  • Monitoring: Track metrics, errors, and costs in production
  • Integrations: Works with OpenAI, Anthropic, LangChain, LlamaIndex

Use alternatives instead:

  • Weights & Biases: Deep learning experiment tracking, model training
  • MLflow: General ML lifecycle, model registry focus
  • Arize/WhyLabs: ML monitoring, data drift detection

Quick start

Installation

pip install langsmith

# Set environment variables

export LANGSMITH_API_KEY="your-api-key"

export LANGSMITH_TRACING=true

Basic tracing with @traceable

from langsmith import traceable

from openai import OpenAI

client = OpenAI()

@traceable

def generate_response(prompt: str) -> str:

    response = client.chat.completions.create(

        model="gpt-4o",

        messages=[{"role": "user", "content": prompt}]

    )

    return response.choices[0].message.content

# Automatically traced to LangSmith

result = generate_response("What is machine learning?")

OpenAI wrapper (automatic tracing)

from langsmith.wrappers import wrap_openai

from openai import OpenAI

# Wrap client for automatic tracing

client = wrap_openai(OpenAI())

# All calls automatically traced

response = client.chat.completions.create(

    model="gpt-4o",

    messages=[{"role": "user", "content": "Hello!"}]

)

Core concepts

Runs and traces

A run is a single execution unit (LLM call, chain, tool). Runs form hierarchical traces showing the full execution flow.

from langsmith import traceable

@traceable(run_type="chain")

def process_query(query: str) -> str:

    # Parent run

    context = retrieve_context(query)  # Child run

    response = generate_answer(query, context)  # Child run

    return response

@traceable(run_type="retriever")

def retrieve_context(query: str) -> list:

    return vector_store.search(query)

@traceable(run_type="llm")

def generate_answer(query: str, context: list) -> str:

    return llm.invoke(f"Context: {context}\n\nQuestion: {query}")

Projects

Projects organize related runs. Set via environment or code:

import os

os.environ["LANGSMITH_PROJECT"] = "my-project"

# Or per-function

@traceable(project_name="my-project")

def my_function():

    pass

Client API

from langsmith import Client

client = Client()

# List runs

runs = list(client.list_runs(

    project_name="my-project",

    filter='eq(status, "success")',

    limit=100

))

# Get run details

run = client.read_run(run_id="...")

# Create feedback

client.create_feedback(

    run_id="...",

    key="correctness",

    score=0.9,

    comment="Good answer"

)

Datasets and evaluation

Create dataset

from langsmith import Client

client = Client()

# Create dataset

dataset = client.create_dataset("qa-test-set", description="QA evaluation")

# Add examples

client.create_examples(

    inputs=[

        {"question": "What is Python?"},

        {"question": "What is ML?"}

    ],

    outputs=[

        {"answer": "A programming language"},

        {"answer": "Machine learning"}

    ],

    dataset_id=dataset.id

)

Run evaluation

from langsmith import evaluate

def my_model(inputs: dict) -> dict:

    # Your model logic

    return {"answer": generate_answer(inputs["question"])}

def correctness_evaluator(run, example):

    prediction = run.outputs["answer"]

    reference = example.outputs["answer"]

    score = 1.0 if reference.lower() in prediction.lower() else 0.0

    return {"key": "correctness", "score": score}

results = evaluate(

    my_model,

    data="qa-test-set",

    evaluators=[correctness_evaluator],

    experiment_prefix="v1"

)

print(f"Average score: {results.aggregate_metrics['correctness']}")

Built-in evaluators

from langsmith.evaluation import LangChainStringEvaluator

# Use LangChain evaluators

results = evaluate(

    my_model,

    data="qa-test-set",

    evaluators=[

        LangChainStringEvaluator("qa"),

        LangChainStringEvaluator("cot_qa")

    ]

)

Advanced tracing

Tracing context

from langsmith import tracing_context

with tracing_context(

    project_name="experiment-1",

    tags=["production", "v2"],

    metadata={"version": "2.0"}

):

    # All traceable calls inherit context

    result = my_function()

Manual runs

from langsmith import trace

with trace(

    name="custom_operation",

    run_type="tool",

    inputs={"query": "test"}

) as run:

    result = do_something()

    run.end(outputs={"result": result})

Process inputs/outputs

def sanitize_inputs(inputs: dict) -> dict:

    if "password" in inputs:

        inputs["password"] = "***"

    return inputs

@traceable(process_inputs=sanitize_inputs)

def login(username: str, password: str):

    return authenticate(username, password)

Sampling

import os

os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"  # 10% sampling

LangChain integration

from langchain_openai import ChatOpenAI

from langchain_core.prompts import ChatPromptTemplate

# Tracing enabled automatically with LANGSMITH_TRACING=true

llm = ChatOpenAI(model="gpt-4o")

prompt = ChatPromptTemplate.from_messages([

    ("system", "You are a helpful assistant."),

    ("user", "{input}")

])

chain = prompt | llm

# All chain runs traced automatically

response = chain.invoke({"input": "Hello!"})

Production monitoring

Hub prompts

from langsmith import Client

client = Client()

# Pull prompt from hub

prompt = client.pull_prompt("my-org/qa-prompt")

# Use in application

result = prompt.invoke({"question": "What is AI?"})

Async client

from langsmith import AsyncClient

async def main():

    client = AsyncClient()

    runs = []

    async for run in client.list_runs(project_name="my-project"):

        runs.append(run)

    return runs

Feedback collection

from langsmith import Client

client = Client()

# Collect user feedback

def record_feedback(run_id: str, user_rating: int, comment: str = None):

    client.create_feedback(

        run_id=run_id,

        key="user_rating",

        score=user_rating / 5.0,  # Normalize to 0-1

        comment=comment

    )

# In your application

record_feedback(run_id="...", user_rating=4, comment="Helpful response")

Testing integration

Pytest integration

from langsmith import test

@test

def test_qa_accuracy():

    result = my_qa_function("What is Python?")

    assert "programming" in result.lower()

Evaluation in CI/CD

from langsmith import evaluate

def run_evaluation():

    results = evaluate(

        my_model,

        data="regression-test-set",

        evaluators=[accuracy_evaluator]

    )

    # Fail CI if accuracy drops

    assert results.aggregate_metrics["accuracy"] >= 0.9, \

        f"Accuracy {results.aggregate_metrics['accuracy']} below threshold"

Best practices

  • Structured naming - Use consistent project/run naming conventions
  • Add metadata - Include version, environment, user info
  • Sample in production - Use sampling rate to control volume
  • Create datasets - Build test sets from interesting production cases
  • Automate evaluation - Run evaluations in CI/CD pipelines
  • Monitor costs - Track token usage and latency trends

Common issues

Traces not appearing:

import os

# Ensure tracing is enabled

os.environ["LANGSMITH_TRACING"] = "true"

os.environ["LANGSMITH_API_KEY"] = "your-key"

# Verify connection

from langsmith import Client

client = Client()

print(client.list_projects())  # Should work

High latency from tracing:

# Enable background batching (default)

from langsmith import Client

client = Client(auto_batch_tracing=True)

# Or use sampling

os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"

Large payloads:

# Hide sensitive/large fields

@traceable(

    process_inputs=lambda x: {k: v for k, v in x.items() if k != "large_field"}

)

def my_function(data):

    pass

References

  • Advanced Usage - Custom evaluators, distributed tracing, hub prompts

Resources

  • Version: 0.2.0+
  • License: MIT
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card