serving-llms-vllm

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference…

INSTALLATION
npx skills add https://github.com/davila7/claude-code-templates --skill serving-llms-vllm
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$2b

outputs = llm.generate(["Explain quantum computing"], sampling)

print(outputs[0].outputs[0].text)

**OpenAI-compatible server**:

vllm serve meta-llama/Llama-3-8B-Instruct

Query with OpenAI SDK

python -c "

from openai import OpenAI

client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')

print(client.chat.completions.create(

model='meta-llama/Llama-3-8B-Instruct',

messages=[{'role': 'user', 'content': 'Hello!'}]

).choices[0].message.content)

"


## Common workflows

### Workflow 1: Production API deployment

Copy this checklist and track progress:

Deployment Progress:

  • [ ] Step 1: Configure server settings
  • [ ] Step 2: Test with limited traffic
  • [ ] Step 3: Enable monitoring
  • [ ] Step 4: Deploy to production
  • [ ] Step 5: Verify performance metrics
  • 
    **Step 1: Configure server settings**
    
    Choose configuration based on your model size:
    

For 7B-13B models on single GPU

vllm serve meta-llama/Llama-3-8B-Instruct \

--gpu-memory-utilization 0.9 \

--max-model-len 8192 \

--port 8000

For 30B-70B models with tensor parallelism

vllm serve meta-llama/Llama-2-70b-hf \

--tensor-parallel-size 4 \

--gpu-memory-utilization 0.9 \

--quantization awq \

--port 8000

For production with caching and metrics

vllm serve meta-llama/Llama-3-8B-Instruct \

--gpu-memory-utilization 0.9 \

--enable-prefix-caching \

--enable-metrics \

--metrics-port 9090 \

--port 8000 \

--host 0.0.0.0


**Step 2: Test with limited traffic**

Run load test before production:

Install load testing tool

pip install locust

Create test_load.py with sample requests

Run: locust -f test_load.py --host http://localhost:8000


Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.

**Step 3: Enable monitoring**

vLLM exposes Prometheus metrics on port 9090:

curl http://localhost:9090/metrics | grep vllm


Key metrics to monitor:

- `vllm:time_to_first_token_seconds` - Latency

- `vllm:num_requests_running` - Active requests

- `vllm:gpu_cache_usage_perc` - KV cache utilization

**Step 4: Deploy to production**

Use Docker for consistent deployment:

Run vLLM in Docker

docker run --gpus all -p 8000:8000 \

vllm/vllm-openai:latest \

--model meta-llama/Llama-3-8B-Instruct \

--gpu-memory-utilization 0.9 \

--enable-prefix-caching


**Step 5: Verify performance metrics**

Check that deployment meets targets:

- TTFT < 500ms (for short prompts)

- Throughput > target req/sec

- GPU utilization > 80%

- No OOM errors in logs

### Workflow 2: Offline batch inference

For processing large datasets without server overhead.

Copy this checklist:

Batch Processing:

  • [ ] Step 1: Prepare input data
  • [ ] Step 2: Configure LLM engine
  • [ ] Step 3: Run batch inference
  • [ ] Step 4: Process results
  • 
    **Step 1: Prepare input data**
    

Load prompts from file

prompts = []

with open("prompts.txt") as f:

prompts = [line.strip() for line in f]

print(f"Loaded {len(prompts)} prompts")


**Step 2: Configure LLM engine**

from vllm import LLM, SamplingParams

llm = LLM(

model="meta-llama/Llama-3-8B-Instruct",

tensor_parallel_size=2, # Use 2 GPUs

gpu_memory_utilization=0.9,

max_model_len=4096

)

sampling = SamplingParams(

temperature=0.7,

top_p=0.95,

max_tokens=512,

stop=["</s>", "\n\n"]

)


**Step 3: Run batch inference**

vLLM automatically batches requests for efficiency:

Process all prompts in one call

outputs = llm.generate(prompts, sampling)

vLLM handles batching internally

No need to manually chunk prompts


**Step 4: Process results**

Extract generated text

results = []

for output in outputs:

prompt = output.prompt

generated = output.outputs[0].text

results.append({

"prompt": prompt,

"generated": generated,

"tokens": len(output.outputs[0].token_ids)

})

Save to file

import json

with open("results.jsonl", "w") as f:

for result in results:

f.write(json.dumps(result) + "\n")

print(f"Processed {len(results)} prompts")


### Workflow 3: Quantized model serving

Fit large models in limited GPU memory.

Quantization Setup:

  • [ ] Step 1: Choose quantization method
  • [ ] Step 2: Find or create quantized model
  • [ ] Step 3: Launch with quantization flag
  • [ ] Step 4: Verify accuracy
  • 
    **Step 1: Choose quantization method**
    
    - **AWQ**: Best for 70B models, minimal accuracy loss
    
    - **GPTQ**: Wide model support, good compression
    
    - **FP8**: Fastest on H100 GPUs
    
    **Step 2: Find or create quantized model**
    
    Use pre-quantized models from HuggingFace:
    

Search for AWQ models

Example: TheBloke/Llama-2-70B-AWQ


**Step 3: Launch with quantization flag**

Using pre-quantized model

vllm serve TheBloke/Llama-2-70B-AWQ \

--quantization awq \

--tensor-parallel-size 1 \

--gpu-memory-utilization 0.95

Results: 70B model in ~40GB VRAM


**Step 4: Verify accuracy**

Test outputs match expected quality:

Compare quantized vs non-quantized responses

Verify task-specific performance unchanged


## When to use vs alternatives

**Use vLLM when:**

- Deploying production LLM APIs (100+ req/sec)

- Serving OpenAI-compatible endpoints

- Limited GPU memory but need large models

- Multi-user applications (chatbots, assistants)

- Need low latency with high throughput

**Use alternatives instead:**

- **llama.cpp**: CPU/edge inference, single-user

- **HuggingFace transformers**: Research, prototyping, one-off generation

- **TensorRT-LLM**: NVIDIA-only, need absolute maximum performance

- **Text-Generation-Inference**: Already in HuggingFace ecosystem

## Common issues

**Issue: Out of memory during model loading**

Reduce memory usage:

vllm serve MODEL \

--gpu-memory-utilization 0.7 \

--max-model-len 4096


Or use quantization:

vllm serve MODEL --quantization awq


**Issue: Slow first token (TTFT > 1 second)**

Enable prefix caching for repeated prompts:

vllm serve MODEL --enable-prefix-caching


For long prompts, enable chunked prefill:

vllm serve MODEL --enable-chunked-prefill


**Issue: Model not found error**

Use `--trust-remote-code` for custom models:

vllm serve MODEL --trust-remote-code


**Issue: Low throughput (<50 req/sec)**

Increase concurrent sequences:

vllm serve MODEL --max-num-seqs 512


Check GPU utilization with `nvidia-smi` - should be >80%.

**Issue: Inference slower than expected**

Verify tensor parallelism uses power of 2 GPUs:

vllm serve MODEL --tensor-parallel-size 4 # Not 3


Enable speculative decoding for faster generation:

vllm serve MODEL --speculative-model DRAFT_MODEL

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card