SKILL.md

$2b

outputs = llm.generate(["Explain quantum computing"], sampling)

print(outputs[0].outputs[0].text)

**OpenAI-compatible server**:

vllm serve meta-llama/Llama-3-8B-Instruct

Query with OpenAI SDK

python -c "

from openai import OpenAI

client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')

print(client.chat.completions.create(

model='meta-llama/Llama-3-8B-Instruct',

messages=[{'role': 'user', 'content': 'Hello!'}]

).choices[0].message.content)


## Common workflows

### Workflow 1: Production API deployment

Copy this checklist and track progress:

Deployment Progress:

[ ] Step 1: Configure server settings

[ ] Step 2: Test with limited traffic

[ ] Step 3: Enable monitoring

[ ] Step 4: Deploy to production

[ ] Step 5: Verify performance metrics


**Step 1: Configure server settings**

Choose configuration based on your model size:

For 7B-13B models on single GPU

vllm serve meta-llama/Llama-3-8B-Instruct \

--gpu-memory-utilization 0.9 \

--max-model-len 8192 \

--port 8000

For 30B-70B models with tensor parallelism

vllm serve meta-llama/Llama-2-70b-hf \

--tensor-parallel-size 4 \

--gpu-memory-utilization 0.9 \

--quantization awq \

--port 8000

For production with caching and metrics

vllm serve meta-llama/Llama-3-8B-Instruct \

--gpu-memory-utilization 0.9 \

--enable-prefix-caching \

--enable-metrics \

--metrics-port 9090 \

--port 8000 \

--host 0.0.0.0


**Step 2: Test with limited traffic**

Run load test before production:

Install load testing tool

pip install locust

Create test_load.py with sample requests

Run: locust -f test_load.py --host http://localhost:8000


Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.

**Step 3: Enable monitoring**

vLLM exposes Prometheus metrics on port 9090:

curl http://localhost:9090/metrics | grep vllm


Key metrics to monitor:

- `vllm:time_to_first_token_seconds` - Latency

- `vllm:num_requests_running` - Active requests

- `vllm:gpu_cache_usage_perc` - KV cache utilization

**Step 4: Deploy to production**

Use Docker for consistent deployment:

Run vLLM in Docker

docker run --gpus all -p 8000:8000 \

vllm/vllm-openai:latest \

--model meta-llama/Llama-3-8B-Instruct \

--gpu-memory-utilization 0.9 \

--enable-prefix-caching


**Step 5: Verify performance metrics**

Check that deployment meets targets:

- TTFT < 500ms (for short prompts)

- Throughput > target req/sec

- GPU utilization > 80%

- No OOM errors in logs

### Workflow 2: Offline batch inference

For processing large datasets without server overhead.

Copy this checklist:

Batch Processing:

[ ] Step 1: Prepare input data

[ ] Step 2: Configure LLM engine

[ ] Step 3: Run batch inference

[ ] Step 4: Process results


**Step 1: Prepare input data**

Load prompts from file

prompts = []

with open("prompts.txt") as f:

prompts = [line.strip() for line in f]

print(f"Loaded {len(prompts)} prompts")


**Step 2: Configure LLM engine**

from vllm import LLM, SamplingParams

llm = LLM(

model="meta-llama/Llama-3-8B-Instruct",

tensor_parallel_size=2, # Use 2 GPUs

gpu_memory_utilization=0.9,

max_model_len=4096

)

sampling = SamplingParams(

temperature=0.7,

top_p=0.95,

max_tokens=512,

stop=["</s>", "\n\n"]

)


**Step 3: Run batch inference**

vLLM automatically batches requests for efficiency:

Process all prompts in one call

outputs = llm.generate(prompts, sampling)

vLLM handles batching internally

No need to manually chunk prompts


**Step 4: Process results**

Extract generated text

results = []

for output in outputs:

prompt = output.prompt

generated = output.outputs[0].text

results.append({

"prompt": prompt,

"generated": generated,

"tokens": len(output.outputs[0].token_ids)

})

Save to file

import json

with open("results.jsonl", "w") as f:

for result in results:

f.write(json.dumps(result) + "\n")

print(f"Processed {len(results)} prompts")


### Workflow 3: Quantized model serving

Fit large models in limited GPU memory.

Quantization Setup:

[ ] Step 1: Choose quantization method

[ ] Step 2: Find or create quantized model

[ ] Step 3: Launch with quantization flag

[ ] Step 4: Verify accuracy


**Step 1: Choose quantization method**

- **AWQ**: Best for 70B models, minimal accuracy loss

- **GPTQ**: Wide model support, good compression

- **FP8**: Fastest on H100 GPUs

**Step 2: Find or create quantized model**

Use pre-quantized models from HuggingFace:

Search for AWQ models

Example: TheBloke/Llama-2-70B-AWQ


**Step 3: Launch with quantization flag**

Using pre-quantized model

vllm serve TheBloke/Llama-2-70B-AWQ \

--quantization awq \

--tensor-parallel-size 1 \

--gpu-memory-utilization 0.95

Results: 70B model in ~40GB VRAM


**Step 4: Verify accuracy**

Test outputs match expected quality:

Compare quantized vs non-quantized responses

Verify task-specific performance unchanged


## When to use vs alternatives

**Use vLLM when:**

- Deploying production LLM APIs (100+ req/sec)

- Serving OpenAI-compatible endpoints

- Limited GPU memory but need large models

- Multi-user applications (chatbots, assistants)

- Need low latency with high throughput

**Use alternatives instead:**

- **llama.cpp**: CPU/edge inference, single-user

- **HuggingFace transformers**: Research, prototyping, one-off generation

- **TensorRT-LLM**: NVIDIA-only, need absolute maximum performance

- **Text-Generation-Inference**: Already in HuggingFace ecosystem

## Common issues

**Issue: Out of memory during model loading**

Reduce memory usage:

vllm serve MODEL \

--gpu-memory-utilization 0.7 \

--max-model-len 4096


Or use quantization:

vllm serve MODEL --quantization awq


**Issue: Slow first token (TTFT > 1 second)**

Enable prefix caching for repeated prompts:

vllm serve MODEL --enable-prefix-caching


For long prompts, enable chunked prefill:

vllm serve MODEL --enable-chunked-prefill


**Issue: Model not found error**

Use `--trust-remote-code` for custom models:

vllm serve MODEL --trust-remote-code


**Issue: Low throughput (<50 req/sec)**

Increase concurrent sequences:

vllm serve MODEL --max-num-seqs 512


Check GPU utilization with `nvidia-smi` - should be >80%.

**Issue: Inference slower than expected**

Verify tensor parallelism uses power of 2 GPUs:

vllm serve MODEL --tensor-parallel-size 4 # Not 3


Enable speculative decoding for faster generation:

vllm serve MODEL --speculative-model DRAFT_MODEL

serving-llms-vllm

SKILL.md

Query with OpenAI SDK

For 7B-13B models on single GPU

For 30B-70B models with tensor parallelism

For production with caching and metrics

Install load testing tool

Create test_load.py with sample requests

Run: locust -f test_load.py --host http://localhost:8000

Run vLLM in Docker

Load prompts from file

Process all prompts in one call

vLLM handles batching internally

No need to manually chunk prompts

Extract generated text

Save to file

Search for AWQ models

Example: TheBloke/Llama-2-70B-AWQ

Using pre-quantized model

Results: 70B model in ~40GB VRAM

Compare quantized vs non-quantized responses

Verify task-specific performance unchanged

Stop writing automation&scrapers

serving-llms-vllm

SKILL.md

Query with OpenAI SDK

For 7B-13B models on single GPU

For 30B-70B models with tensor parallelism

For production with caching and metrics

Install load testing tool

Create test_load.py with sample requests

Run: locust -f test_load.py --host http://localhost:8000

Run vLLM in Docker

Load prompts from file

Process all prompts in one call

vLLM handles batching internally

No need to manually chunk prompts

Extract generated text

Save to file

Search for AWQ models

Example: TheBloke/Llama-2-70B-AWQ

Using pre-quantized model

Results: 70B model in ~40GB VRAM

Compare quantized vs non-quantized responses

Verify task-specific performance unchanged

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers