SKILL.md

$2b

Use alternatives instead:

AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs

HQQ: Fast calibration-free quantization for HuggingFace

bitsandbytes: Simple integration with transformers library

TensorRT-LLM: Production NVIDIA deployment with maximum speed

Quick start

Installation

# Clone llama.cpp

git clone https://github.com/ggml-org/llama.cpp

cd llama.cpp

# Build (CPU)

make

# Build with CUDA (NVIDIA)

make GGML_CUDA=1

# Build with Metal (Apple Silicon)

make GGML_METAL=1

# Install Python bindings (optional)

pip install llama-cpp-python

Convert model to GGUF

# Install requirements

pip install -r requirements.txt

# Convert HuggingFace model to GGUF (FP16)

python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf

# Or specify output type

python convert_hf_to_gguf.py ./path/to/model \

    --outfile model-f16.gguf \

    --outtype f16

Quantize model

# Basic quantization to Q4_K_M

./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# Quantize with importance matrix (better quality)

./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix

./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

Run inference

# CLI inference

./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"

# Interactive mode

./llama-cli -m model-q4_k_m.gguf --interactive

# With GPU offload

./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"

Quantization types

K-quant methods (recommended)

Type

Bits

Size (7B)

Quality

Use Case

Q2_K

2.5

~2.8 GB

Low

Extreme compression

Q3_K_S

3.0

~3.0 GB

Low-Med

Memory constrained

Q3_K_M

3.3

~3.3 GB

Medium

Balance

Q4_K_S

4.0

~3.8 GB

Med-High

Good balance

Q4_K_M

4.5

~4.1 GB

High

Recommended default

Q5_K_S

5.0

~4.6 GB

High

Quality focused

Q5_K_M

5.5

~4.8 GB

Very High

High quality

Q6_K

6.0

~5.5 GB

Excellent

Near-original

Q8_0

8.0

~7.2 GB

Best

Maximum quality

Legacy methods

Type

Description

Q4_0

4-bit, basic

Q4_1

4-bit with delta

Q5_0

5-bit, basic

Q5_1

5-bit with delta

Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.

Conversion workflows

Workflow 1: HuggingFace to GGUF

# 1. Download model

huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b

# 2. Convert to GGUF (FP16)

python convert_hf_to_gguf.py ./llama-3.1-8b \

    --outfile llama-3.1-8b-f16.gguf \

    --outtype f16

# 3. Quantize

./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M

# 4. Test

./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

Workflow 2: With importance matrix (better quality)

# 1. Convert to GGUF

python convert_hf_to_gguf.py ./model --outfile model-f16.gguf

# 2. Create calibration text (diverse samples)

cat > calibration.txt << 'EOF'

The quick brown fox jumps over the lazy dog.

Machine learning is a subset of artificial intelligence.

Python is a popular programming language.

# Add more diverse text samples...

EOF

# 3. Generate importance matrix

./llama-imatrix -m model-f16.gguf \

    -f calibration.txt \

    --chunk 512 \

    -o model.imatrix \

    -ngl 35  # GPU layers if available

# 4. Quantize with imatrix

./llama-quantize --imatrix model.imatrix \

    model-f16.gguf \

    model-q4_k_m.gguf \

    Q4_K_M

Workflow 3: Multiple quantizations

#!/bin/bash

MODEL="llama-3.1-8b-f16.gguf"

IMATRIX="llama-3.1-8b.imatrix"

# Generate imatrix once

./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35

# Create multiple quantizations

for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do

    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"

    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT

    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"

done

Python usage

llama-cpp-python

from llama_cpp import Llama

# Load model

llm = Llama(

    model_path="./model-q4_k_m.gguf",

    n_ctx=4096,          # Context window

    n_gpu_layers=35,     # GPU offload (0 for CPU only)

    n_threads=8          # CPU threads

)

# Generate

output = llm(

    "What is machine learning?",

    max_tokens=256,

    temperature=0.7,

    stop=["</s>", "\n\n"]

)

print(output["choices"][0]["text"])

Chat completion

from llama_cpp import Llama

llm = Llama(

    model_path="./model-q4_k_m.gguf",

    n_ctx=4096,

    n_gpu_layers=35,

    chat_format="llama-3"  # Or "chatml", "mistral", etc.

)

messages = [

    {"role": "system", "content": "You are a helpful assistant."},

    {"role": "user", "content": "What is Python?"}

]

response = llm.create_chat_completion(

    messages=messages,

    max_tokens=256,

    temperature=0.7

)

print(response["choices"][0]["message"]["content"])

Streaming

from llama_cpp import Llama

llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)

# Stream tokens

for chunk in llm(

    "Explain quantum computing:",

    max_tokens=256,

    stream=True

):

    print(chunk["choices"][0]["text"], end="", flush=True)

Server mode

Start OpenAI-compatible server

# Start server

./llama-server -m model-q4_k_m.gguf \

    --host 0.0.0.0 \

    --port 8080 \

    -ngl 35 \

    -c 4096

# Or with Python bindings

python -m llama_cpp.server \

    --model model-q4_k_m.gguf \

    --n_gpu_layers 35 \

    --host 0.0.0.0 \

    --port 8080

Use with OpenAI client

from openai import OpenAI

client = OpenAI(

    base_url="http://localhost:8080/v1",

    api_key="not-needed"

)

response = client.chat.completions.create(

    model="local-model",

    messages=[{"role": "user", "content": "Hello!"}],

    max_tokens=256

)

print(response.choices[0].message.content)

Hardware optimization

Apple Silicon (Metal)

# Build with Metal

make clean &#x26;&#x26; make GGML_METAL=1

# Run with Metal acceleration

./llama-cli -m model.gguf -ngl 99 -p "Hello"

# Python with Metal

llm = Llama(

    model_path="model.gguf",

    n_gpu_layers=99,     # Offload all layers

    n_threads=1          # Metal handles parallelism

)

NVIDIA CUDA

# Build with CUDA

make clean &#x26;&#x26; make GGML_CUDA=1

# Run with CUDA

./llama-cli -m model.gguf -ngl 35 -p "Hello"

# Specify GPU

CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35

CPU optimization

# Build with AVX2/AVX512

make clean &#x26;&#x26; make

# Run with optimal threads

./llama-cli -m model.gguf -t 8 -p "Hello"

# Python CPU config

llm = Llama(

    model_path="model.gguf",

    n_gpu_layers=0,      # CPU only

    n_threads=8,         # Match physical cores

    n_batch=512          # Batch size for prompt processing

)

Integration with tools

Ollama

# Create Modelfile

cat > Modelfile << 'EOF'

FROM ./model-q4_k_m.gguf

TEMPLATE """{{ .System }}

{{ .Prompt }}"""

PARAMETER temperature 0.7

PARAMETER num_ctx 4096

EOF

# Create Ollama model

ollama create mymodel -f Modelfile

# Run

ollama run mymodel "Hello!"

LM Studio

Place GGUF file in ~/.cache/lm-studio/models/

Open LM Studio and select the model

Configure context length and GPU offload

Start inference

text-generation-webui

# Place in models folder

cp model-q4_k_m.gguf text-generation-webui/models/

# Start with llama.cpp loader

python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

Best practices

Use K-quants: Q4_K_M offers best quality/size balance

Use imatrix: Always use importance matrix for Q4 and below

GPU offload: Offload as many layers as VRAM allows

Context length: Start with 4096, increase if needed

Thread count: Match physical CPU cores, not logical

Batch size: Increase n_batch for faster prompt processing

Common issues

Model loads slowly:

# Use mmap for faster loading

./llama-cli -m model.gguf --mmap

Out of memory:

# Reduce GPU layers

./llama-cli -m model.gguf -ngl 20  # Reduce from 35

# Or use smaller quantization

./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M

Poor quality at low bits:

# Always use imatrix for Q4 and below

./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix

./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

References

Advanced Usage - Batching, speculative decoding, custom builds

Troubleshooting - Common issues, debugging, benchmarks

Resources

Repository: https://github.com/ggml-org/llama.cpp

Python Bindings: https://github.com/abetlen/llama-cpp-python

Pre-quantized Models: https://huggingface.co/TheBloke

GGUF Converter: https://huggingface.co/spaces/ggml-org/gguf-my-repo

License: MIT

gguf-quantization

SKILL.md

Quick start

Installation

Convert model to GGUF

Quantize model

Run inference

Quantization types

K-quant methods (recommended)

Legacy methods

Conversion workflows

Workflow 1: HuggingFace to GGUF

Workflow 2: With importance matrix (better quality)

Workflow 3: Multiple quantizations

Python usage

llama-cpp-python

Chat completion

Streaming

Server mode

Start OpenAI-compatible server

Use with OpenAI client

Hardware optimization

Apple Silicon (Metal)

NVIDIA CUDA

CPU optimization

Integration with tools

Ollama

LM Studio

text-generation-webui

Best practices

Common issues

References

Resources

Stop writing automation&scrapers

gguf-quantization

SKILL.md

Quick start

Installation

Convert model to GGUF

Quantize model

Run inference

Quantization types

K-quant methods (recommended)

Legacy methods

Conversion workflows

Workflow 1: HuggingFace to GGUF

Workflow 2: With importance matrix (better quality)

Workflow 3: Multiple quantizations

Python usage

llama-cpp-python

Chat completion

Streaming

Server mode

Start OpenAI-compatible server

Use with OpenAI client

Hardware optimization

Apple Silicon (Metal)

NVIDIA CUDA

CPU optimization

Integration with tools

Ollama

LM Studio

text-generation-webui

Best practices

Common issues

References

Resources

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers