SKILL.md

Senior ML Engineer

Production ML engineering patterns for model deployment, MLOps infrastructure, and LLM integration.

[Model Deployment Workflow](#model-deployment-workflow)

[MLOps Pipeline Setup](#mlops-pipeline-setup)

[LLM Integration Workflow](#llm-integration-workflow)

[RAG System Implementation](#rag-system-implementation)

[Model Monitoring](#model-monitoring)

[Reference Documentation](#reference-documentation)

[Tools](#tools)

Model Deployment Workflow

Deploy a trained model to production with monitoring:

Export model to standardized format (ONNX, TorchScript, SavedModel)

Package model with dependencies in Docker container

Deploy to staging environment

Run integration tests against staging

Deploy canary (5% traffic) to production

Monitor latency and error rates for 1 hour

Promote to full production if metrics pass

Validation: p95 latency < 100ms, error rate < 0.1%

Container Template

FROM python:3.11-slim

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY model/ /app/model/

COPY src/ /app/src/

HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080

CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]

Serving Options

Option

Latency

Throughput

Use Case

FastAPI + Uvicorn

Low

Medium

REST APIs, small models

Triton Inference Server

Very Low

Very High

GPU inference, batching

TensorFlow Serving

Low

High

TensorFlow models

TorchServe

Low

High

PyTorch models

Ray Serve

Medium

High

Complex pipelines, multi-model

MLOps Pipeline Setup

Establish automated training and deployment:

Configure feature store (Feast, Tecton) for training data

Set up experiment tracking (MLflow, Weights & Biases)

Create training pipeline with hyperparameter logging

Configure staging deployment triggered by registry events

Set up A/B testing infrastructure for model comparison

Enable drift monitoring with alerting

Validation: New models automatically evaluated against baseline

Feature Store Pattern

from feast import Entity, Feature, FeatureView, FileSource

user = Entity(name="user_id", value_type=ValueType.INT64)

user_features = FeatureView(

    name="user_features",

    entities=["user_id"],

    ttl=timedelta(days=1),

    features=[

        Feature(name="purchase_count_30d", dtype=ValueType.INT64),

        Feature(name="avg_order_value", dtype=ValueType.FLOAT),

    ],

    online=True,

    source=FileSource(path="data/user_features.parquet"),

)

Retraining Triggers

Trigger

Detection

Action

Scheduled

Cron (weekly/monthly)

Full retrain

Performance drop

Accuracy < threshold

Immediate retrain

Data drift

PSI > 0.2

Evaluate, then retrain

New data volume

X new samples

Incremental update

LLM Integration Workflow

Integrate LLM APIs into production applications:

Create provider abstraction layer for vendor flexibility

Implement retry logic with exponential backoff

Configure fallback to secondary provider

Set up token counting and context truncation

Add response caching for repeated queries

Implement cost tracking per request

Add structured output validation with Pydantic

Validation: Response parses correctly, cost within budget

Provider Abstraction

from abc import ABC, abstractmethod

from tenacity import retry, stop_after_attempt, wait_exponential

class LLMProvider(ABC):

    @abstractmethod

    def complete(self, prompt: str, **kwargs) -> str:

        pass

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))

def call_llm_with_retry(provider: LLMProvider, prompt: str) -> str:

    return provider.complete(prompt)

Cost Management

Provider

Input Cost

Output Cost

GPT-4

$0.03/1K

$0.06/1K

GPT-3.5

$0.0005/1K

$0.0015/1K

Claude 3 Opus

$0.015/1K

$0.075/1K

Claude 3 Haiku

$0.00025/1K

$0.00125/1K

RAG System Implementation

Build retrieval-augmented generation pipeline:

Choose vector database (Pinecone, Qdrant, Weaviate)

Select embedding model based on quality/cost tradeoff

Implement document chunking strategy

Create ingestion pipeline with metadata extraction

Build retrieval with query embedding

Add reranking for relevance improvement

Format context and send to LLM

Validation: Response references retrieved context, no hallucinations

Vector Database Selection

Database

Hosting

Scale

Latency

Best For

Pinecone

Managed

High

Low

Production, managed

Qdrant

Both

High

Very Low

Performance-critical

Weaviate

Both

High

Low

Hybrid search

Chroma

Self-hosted

Medium

Low

Prototyping

pgvector

Self-hosted

Medium

Existing Postgres

Chunking Strategies

Strategy

Chunk Size

Overlap

Best For

Fixed

500-1000 tokens

50-100

General text

Sentence

3-5 sentences

1 sentence

Structured text

Semantic

Variable

Based on meaning

Research papers

Recursive

Hierarchical

Parent-child

Long documents

Model Monitoring

Monitor production models for drift and degradation:

Set up latency tracking (p50, p95, p99)

Configure error rate alerting

Implement input data drift detection

Track prediction distribution shifts

Log ground truth when available

Compare model versions with A/B metrics

Set up automated retraining triggers

Validation: Alerts fire before user-visible degradation

Drift Detection

from scipy.stats import ks_2samp

def detect_drift(reference, current, threshold=0.05):

    statistic, p_value = ks_2samp(reference, current)

    return {

        "drift_detected": p_value < threshold,

        "ks_statistic": statistic,

        "p_value": p_value

    }

Alert Thresholds

Metric

Warning

Critical

p95 latency

100ms

200ms

Error rate

0.1%

1%

PSI (drift)

0.1

0.2

Accuracy drop

2%

5%

Reference Documentation

MLOps Production Patterns

references/mlops_production_patterns.md contains:

Model deployment pipeline with Kubernetes manifests

Feature store architecture with Feast examples

Model monitoring with drift detection code

A/B testing infrastructure with traffic splitting

Automated retraining pipeline with MLflow

LLM Integration Guide

references/llm_integration_guide.md contains:

Provider abstraction layer pattern

Retry and fallback strategies with tenacity

Prompt engineering templates (few-shot, CoT)

Token optimization with tiktoken

Cost calculation and tracking

RAG System Architecture

references/rag_system_architecture.md contains:

RAG pipeline implementation with code

Vector database comparison and integration

Chunking strategies (fixed, semantic, recursive)

Embedding model selection guide

Hybrid search and reranking patterns

Tools

Model Deployment Pipeline

python scripts/model_deployment_pipeline.py --model model.pkl --target staging

Generates deployment artifacts: Dockerfile, Kubernetes manifests, health checks.

RAG System Builder

python scripts/rag_system_builder.py --config rag_config.yaml --analyze

Scaffolds RAG pipeline with vector store integration and retrieval logic.

ML Monitoring Suite

python scripts/ml_monitoring_suite.py --config monitoring.yaml --deploy

Sets up drift detection, alerting, and performance dashboards.

senior-ml-engineer

SKILL.md

Senior ML Engineer

Table of Contents

Model Deployment Workflow

Container Template

Serving Options

MLOps Pipeline Setup

Feature Store Pattern

Retraining Triggers

LLM Integration Workflow

Provider Abstraction

Cost Management

RAG System Implementation

Vector Database Selection

Chunking Strategies

Model Monitoring

Drift Detection

Alert Thresholds

Reference Documentation

MLOps Production Patterns

LLM Integration Guide

RAG System Architecture

Tools

Model Deployment Pipeline

RAG System Builder

ML Monitoring Suite

Tech Stack

Stop writing automation&scrapers

senior-ml-engineer

SKILL.md

Senior ML Engineer

Table of Contents

Model Deployment Workflow

Container Template

Serving Options

MLOps Pipeline Setup

Feature Store Pattern

Retraining Triggers

LLM Integration Workflow

Provider Abstraction

Cost Management

RAG System Implementation

Vector Database Selection

Chunking Strategies

Model Monitoring

Drift Detection

Alert Thresholds

Reference Documentation

MLOps Production Patterns

LLM Integration Guide

RAG System Architecture

Tools

Model Deployment Pipeline

RAG System Builder

ML Monitoring Suite

Tech Stack

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers