SKILL.md

LangSmith - LLM Observability Platform

Development platform for debugging, evaluating, and monitoring language models and AI applications.

When to use LangSmith

Use LangSmith when:

Debugging LLM application issues (prompts, chains, agents)

Evaluating model outputs systematically against datasets

Monitoring production LLM systems

Building regression testing for AI features

Analyzing latency, token usage, and costs

Collaborating on prompt engineering

Key features:

Tracing: Capture inputs, outputs, latency for all LLM calls

Evaluation: Systematic testing with built-in and custom evaluators

Datasets: Create test sets from production traces or manually

Monitoring: Track metrics, errors, and costs in production

Integrations: Works with OpenAI, Anthropic, LangChain, LlamaIndex

Use alternatives instead:

Weights & Biases: Deep learning experiment tracking, model training

MLflow: General ML lifecycle, model registry focus

Arize/WhyLabs: ML monitoring, data drift detection

Quick start

Installation

pip install langsmith

# Set environment variables

export LANGSMITH_API_KEY="your-api-key"

export LANGSMITH_TRACING=true

Basic tracing with @traceable

from langsmith import traceable

from openai import OpenAI

client = OpenAI()

@traceable

def generate_response(prompt: str) -> str:

    response = client.chat.completions.create(

        model="gpt-4o",

        messages=[{"role": "user", "content": prompt}]

    )

    return response.choices[0].message.content

# Automatically traced to LangSmith

result = generate_response("What is machine learning?")

OpenAI wrapper (automatic tracing)

from langsmith.wrappers import wrap_openai

from openai import OpenAI

# Wrap client for automatic tracing

client = wrap_openai(OpenAI())

# All calls automatically traced

response = client.chat.completions.create(

    model="gpt-4o",

    messages=[{"role": "user", "content": "Hello!"}]

)

Core concepts

Runs and traces

A run is a single execution unit (LLM call, chain, tool). Runs form hierarchical traces showing the full execution flow.

from langsmith import traceable

@traceable(run_type="chain")

def process_query(query: str) -> str:

    # Parent run

    context = retrieve_context(query)  # Child run

    response = generate_answer(query, context)  # Child run

    return response

@traceable(run_type="retriever")

def retrieve_context(query: str) -> list:

    return vector_store.search(query)

@traceable(run_type="llm")

def generate_answer(query: str, context: list) -> str:

    return llm.invoke(f"Context: {context}\n\nQuestion: {query}")

Projects

Projects organize related runs. Set via environment or code:

import os

os.environ["LANGSMITH_PROJECT"] = "my-project"

# Or per-function

@traceable(project_name="my-project")

def my_function():

    pass

Client API

from langsmith import Client

client = Client()

# List runs

runs = list(client.list_runs(

    project_name="my-project",

    filter='eq(status, "success")',

    limit=100

))

# Get run details

run = client.read_run(run_id="...")

# Create feedback

client.create_feedback(

    run_id="...",

    key="correctness",

    score=0.9,

    comment="Good answer"

)

Datasets and evaluation

Create dataset

from langsmith import Client

client = Client()

# Create dataset

dataset = client.create_dataset("qa-test-set", description="QA evaluation")

# Add examples

client.create_examples(

    inputs=[

        {"question": "What is Python?"},

        {"question": "What is ML?"}

    ],

    outputs=[

        {"answer": "A programming language"},

        {"answer": "Machine learning"}

    ],

    dataset_id=dataset.id

)

Run evaluation

from langsmith import evaluate

def my_model(inputs: dict) -> dict:

    # Your model logic

    return {"answer": generate_answer(inputs["question"])}

def correctness_evaluator(run, example):

    prediction = run.outputs["answer"]

    reference = example.outputs["answer"]

    score = 1.0 if reference.lower() in prediction.lower() else 0.0

    return {"key": "correctness", "score": score}

results = evaluate(

    my_model,

    data="qa-test-set",

    evaluators=[correctness_evaluator],

    experiment_prefix="v1"

)

print(f"Average score: {results.aggregate_metrics['correctness']}")

Built-in evaluators

from langsmith.evaluation import LangChainStringEvaluator

# Use LangChain evaluators

results = evaluate(

    my_model,

    data="qa-test-set",

    evaluators=[

        LangChainStringEvaluator("qa"),

        LangChainStringEvaluator("cot_qa")

    ]

)

Advanced tracing

Tracing context

from langsmith import tracing_context

with tracing_context(

    project_name="experiment-1",

    tags=["production", "v2"],

    metadata={"version": "2.0"}

):

    # All traceable calls inherit context

    result = my_function()

Manual runs

from langsmith import trace

with trace(

    name="custom_operation",

    run_type="tool",

    inputs={"query": "test"}

) as run:

    result = do_something()

    run.end(outputs={"result": result})

Process inputs/outputs

def sanitize_inputs(inputs: dict) -> dict:

    if "password" in inputs:

        inputs["password"] = "***"

    return inputs

@traceable(process_inputs=sanitize_inputs)

def login(username: str, password: str):

    return authenticate(username, password)

Sampling

import os

os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"  # 10% sampling

LangChain integration

from langchain_openai import ChatOpenAI

from langchain_core.prompts import ChatPromptTemplate

# Tracing enabled automatically with LANGSMITH_TRACING=true

llm = ChatOpenAI(model="gpt-4o")

prompt = ChatPromptTemplate.from_messages([

    ("system", "You are a helpful assistant."),

    ("user", "{input}")

])

chain = prompt | llm

# All chain runs traced automatically

response = chain.invoke({"input": "Hello!"})

Production monitoring

Hub prompts

from langsmith import Client

client = Client()

# Pull prompt from hub

prompt = client.pull_prompt("my-org/qa-prompt")

# Use in application

result = prompt.invoke({"question": "What is AI?"})

Async client

from langsmith import AsyncClient

async def main():

    client = AsyncClient()

    runs = []

    async for run in client.list_runs(project_name="my-project"):

        runs.append(run)

    return runs

Feedback collection

from langsmith import Client

client = Client()

# Collect user feedback

def record_feedback(run_id: str, user_rating: int, comment: str = None):

    client.create_feedback(

        run_id=run_id,

        key="user_rating",

        score=user_rating / 5.0,  # Normalize to 0-1

        comment=comment

    )

# In your application

record_feedback(run_id="...", user_rating=4, comment="Helpful response")

Testing integration

Pytest integration

from langsmith import test

@test

def test_qa_accuracy():

    result = my_qa_function("What is Python?")

    assert "programming" in result.lower()

Evaluation in CI/CD

from langsmith import evaluate

def run_evaluation():

    results = evaluate(

        my_model,

        data="regression-test-set",

        evaluators=[accuracy_evaluator]

    )

    # Fail CI if accuracy drops

    assert results.aggregate_metrics["accuracy"] >= 0.9, \

        f"Accuracy {results.aggregate_metrics['accuracy']} below threshold"

Best practices

Structured naming - Use consistent project/run naming conventions

Add metadata - Include version, environment, user info

Sample in production - Use sampling rate to control volume

Create datasets - Build test sets from interesting production cases

Automate evaluation - Run evaluations in CI/CD pipelines

Monitor costs - Track token usage and latency trends

Common issues

Traces not appearing:

import os

# Ensure tracing is enabled

os.environ["LANGSMITH_TRACING"] = "true"

os.environ["LANGSMITH_API_KEY"] = "your-key"

# Verify connection

from langsmith import Client

client = Client()

print(client.list_projects())  # Should work

High latency from tracing:

# Enable background batching (default)

from langsmith import Client

client = Client(auto_batch_tracing=True)

# Or use sampling

os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"

Large payloads:

# Hide sensitive/large fields

@traceable(

    process_inputs=lambda x: {k: v for k, v in x.items() if k != "large_field"}

)

def my_function(data):

    pass

References

Advanced Usage - Custom evaluators, distributed tracing, hub prompts

Troubleshooting - Common issues, debugging, performance

Resources

Documentation: https://docs.smith.langchain.com

Python SDK: https://github.com/langchain-ai/langsmith-sdk

Web App: https://smith.langchain.com

Version: 0.2.0+

License: MIT

langsmith-observability

SKILL.md

LangSmith - LLM Observability Platform

When to use LangSmith

Quick start

Installation

Basic tracing with @traceable

OpenAI wrapper (automatic tracing)

Core concepts

Runs and traces

Projects

Client API

Datasets and evaluation

Create dataset

Run evaluation

Built-in evaluators

Advanced tracing

Tracing context

Manual runs

Process inputs/outputs

Sampling

LangChain integration

Production monitoring

Hub prompts

Async client

Feedback collection

Testing integration

Pytest integration

Evaluation in CI/CD

Best practices

Common issues

References

Resources

Stop writing automation&scrapers

langsmith-observability

SKILL.md

LangSmith - LLM Observability Platform

When to use LangSmith

Quick start

Installation

Basic tracing with @traceable

OpenAI wrapper (automatic tracing)

Core concepts

Runs and traces

Projects

Client API

Datasets and evaluation

Create dataset

Run evaluation

Built-in evaluators

Advanced tracing

Tracing context

Manual runs

Process inputs/outputs

Sampling

LangChain integration

Production monitoring

Hub prompts

Async client

Feedback collection

Testing integration

Pytest integration

Evaluation in CI/CD

Best practices

Common issues

References

Resources

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers