peft-fine-tuning

Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need…

INSTALLATION
npx skills add https://github.com/davila7/claude-code-templates --skill peft-fine-tuning
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

PEFT (Parameter-Efficient Fine-Tuning)

Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.

When to use PEFT

Use PEFT/LoRA when:

  • Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
  • Need to train <1% parameters (6MB adapters vs 14GB full model)
  • Want fast iteration with multiple task-specific adapters
  • Deploying multiple fine-tuned variants from one base model

Use QLoRA (PEFT + quantization) when:

  • Fine-tuning 70B models on single 24GB GPU
  • Memory is the primary constraint
  • Can accept ~5% quality trade-off vs full fine-tuning

Use full fine-tuning instead when:

  • Training small models (<1B parameters)
  • Need maximum quality and have compute budget
  • Significant domain shift requires updating all weights

Quick start

Installation

# Basic installation

pip install peft

# With quantization support (recommended)

pip install peft bitsandbytes

# Full stack

pip install peft transformers accelerate bitsandbytes datasets

LoRA fine-tuning (standard)

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer

from peft import get_peft_model, LoraConfig, TaskType

from datasets import load_dataset

# Load base model

model_name = "meta-llama/Llama-3.1-8B"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration

lora_config = LoraConfig(

    task_type=TaskType.CAUSAL_LM,

    r=16,                          # Rank (8-64, higher = more capacity)

    lora_alpha=32,                 # Scaling factor (typically 2*r)

    lora_dropout=0.05,             # Dropout for regularization

    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers

    bias="none"                    # Don't train biases

)

# Apply LoRA

model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

# Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

# Prepare dataset

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

def tokenize(example):

    text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"

    return tokenizer(text, truncation=True, max_length=512, padding="max_length")

tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)

# Training

training_args = TrainingArguments(

    output_dir="./lora-llama",

    num_train_epochs=3,

    per_device_train_batch_size=4,

    gradient_accumulation_steps=4,

    learning_rate=2e-4,

    fp16=True,

    logging_steps=10,

    save_strategy="epoch"

)

trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=tokenized,

    data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),

                                 "attention_mask": torch.stack([f["attention_mask"] for f in data]),

                                 "labels": torch.stack([f["input_ids"] for f in data])}

)

trainer.train()

# Save adapter only (6MB vs 16GB)

model.save_pretrained("./lora-llama-adapter")

QLoRA fine-tuning (memory-efficient)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

# 4-bit quantization config

bnb_config = BitsAndBytesConfig(

    load_in_4bit=True,

    bnb_4bit_quant_type="nf4",           # NormalFloat4 (best for LLMs)

    bnb_4bit_compute_dtype="bfloat16",   # Compute in bf16

    bnb_4bit_use_double_quant=True       # Nested quantization

)

# Load quantized model

model = AutoModelForCausalLM.from_pretrained(

    "meta-llama/Llama-3.1-70B",

    quantization_config=bnb_config,

    device_map="auto"

)

# Prepare for training (enables gradient checkpointing)

model = prepare_model_for_kbit_training(model)

# LoRA config for QLoRA

lora_config = LoraConfig(

    r=64,                              # Higher rank for 70B

    lora_alpha=128,

    lora_dropout=0.1,

    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],

    bias="none",

    task_type="CAUSAL_LM"

)

model = get_peft_model(model, lora_config)

# 70B model now fits on single 24GB GPU!

LoRA parameter selection

Rank (r) - capacity vs efficiency

Rank

Trainable Params

Memory

Quality

Use Case

4

~3M

Minimal

Lower

Simple tasks, prototyping

8

~7M

Low

Good

Recommended starting point

16

~14M

Medium

Better

General fine-tuning

32

~27M

Higher

High

Complex tasks

64

~54M

High

Highest

Domain adaptation, 70B models

Alpha (lora_alpha) - scaling factor

# Rule of thumb: alpha = 2 * rank

LoraConfig(r=16, lora_alpha=32)  # Standard

LoraConfig(r=16, lora_alpha=16)  # Conservative (lower learning rate effect)

LoraConfig(r=16, lora_alpha=64)  # Aggressive (higher learning rate effect)

Target modules by architecture

# Llama / Mistral / Qwen

target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# GPT-2 / GPT-Neo

target_modules = ["c_attn", "c_proj", "c_fc"]

# Falcon

target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# BLOOM

target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# Auto-detect all linear layers

target_modules = "all-linear"  # PEFT 0.6.0+

Loading and merging adapters

Load trained adapter

from peft import PeftModel, AutoPeftModelForCausalLM

from transformers import AutoModelForCausalLM

# Option 1: Load with PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")

# Option 2: Load directly (recommended)

model = AutoPeftModelForCausalLM.from_pretrained(

    "./lora-llama-adapter",

    device_map="auto"

)

Merge adapter into base model

# Merge for deployment (no adapter overhead)

merged_model = model.merge_and_unload()

# Save merged model

merged_model.save_pretrained("./llama-merged")

tokenizer.save_pretrained("./llama-merged")

# Push to Hub

merged_model.push_to_hub("username/llama-finetuned")

Multi-adapter serving

from peft import PeftModel

# Load base with first adapter

model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")

# Load additional adapters

model.load_adapter("./adapter-task2", adapter_name="task2")

model.load_adapter("./adapter-task3", adapter_name="task3")

# Switch between adapters at runtime

model.set_adapter("task1")  # Use task1 adapter

output1 = model.generate(**inputs)

model.set_adapter("task2")  # Switch to task2

output2 = model.generate(**inputs)

# Disable adapters (use base model)

with model.disable_adapter():

    base_output = model.generate(**inputs)

PEFT methods comparison

Method

Trainable %

Memory

Speed

Best For

LoRA

0.1-1%

Low

Fast

General fine-tuning

QLoRA

0.1-1%

Very Low

Medium

Memory-constrained

AdaLoRA

0.1-1%

Low

Medium

Automatic rank selection

IA3

0.01%

Minimal

Fastest

Few-shot adaptation

Prefix Tuning

0.1%

Low

Medium

Generation control

Prompt Tuning

0.001%

Minimal

Fast

Simple task adaptation

P-Tuning v2

0.1%

Low

Medium

NLU tasks

IA3 (minimal parameters)

from peft import IA3Config

ia3_config = IA3Config(

    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],

    feedforward_modules=["down_proj"]

)

model = get_peft_model(model, ia3_config)

# Trains only 0.01% of parameters!

Prefix Tuning

from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(

    task_type="CAUSAL_LM",

    num_virtual_tokens=20,      # Prepended tokens

    prefix_projection=True       # Use MLP projection

)

model = get_peft_model(model, prefix_config)

Integration patterns

With TRL (SFTTrainer)

from trl import SFTTrainer, SFTConfig

from peft import LoraConfig

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = SFTTrainer(

    model=model,

    args=SFTConfig(output_dir="./output", max_seq_length=512),

    train_dataset=dataset,

    peft_config=lora_config,  # Pass LoRA config directly

)

trainer.train()

With Axolotl (YAML config)

# axolotl config.yaml

adapter: lora

lora_r: 16

lora_alpha: 32

lora_dropout: 0.05

lora_target_modules:

  - q_proj

  - v_proj

  - k_proj

  - o_proj

lora_target_linear: true  # Target all linear layers

With vLLM (inference)

from vllm import LLM

from vllm.lora.request import LoRARequest

# Load base model with LoRA support

llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)

# Serve with adapter

outputs = llm.generate(

    prompts,

    lora_request=LoRARequest("adapter1", 1, "./lora-adapter")

)

Performance benchmarks

Memory usage (Llama 3.1 8B)

Method

GPU Memory

Trainable Params

Full fine-tuning

60+ GB

8B (100%)

LoRA r=16

18 GB

14M (0.17%)

QLoRA r=16

6 GB

14M (0.17%)

IA3

16 GB

800K (0.01%)

Training speed (A100 80GB)

Method

Tokens/sec

vs Full FT

Full FT

2,500

1x

LoRA

3,200

1.3x

QLoRA

2,100

0.84x

Quality (MMLU benchmark)

Model

Full FT

LoRA

QLoRA

Llama 2-7B

45.3

44.8

44.1

Llama 2-13B

54.8

54.2

53.5

Common issues

CUDA OOM during training

# Solution 1: Enable gradient checkpointing

model.gradient_checkpointing_enable()

# Solution 2: Reduce batch size + increase accumulation

TrainingArguments(

    per_device_train_batch_size=1,

    gradient_accumulation_steps=16

)

# Solution 3: Use QLoRA

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")

Adapter not applying

# Verify adapter is active

print(model.active_adapters)  # Should show adapter name

# Check trainable parameters

model.print_trainable_parameters()

# Ensure model in training mode

model.train()

Quality degradation

# Increase rank

LoraConfig(r=32, lora_alpha=64)

# Target more modules

target_modules = "all-linear"

# Use more training data and epochs

TrainingArguments(num_train_epochs=5)

# Lower learning rate

TrainingArguments(learning_rate=1e-4)

Best practices

  • Start with r=8-16, increase if quality insufficient
  • Use alpha = 2 * rank as starting point
  • Target attention + MLP layers for best quality/efficiency
  • Enable gradient checkpointing for memory savings
  • Save adapters frequently (small files, easy rollback)
  • Evaluate on held-out data before merging
  • Use QLoRA for 70B+ models on consumer hardware

References

Resources

  • LoRA Paper: arXiv:2106.09685
  • QLoRA Paper: arXiv:2305.14314
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card