stable-diffusion-image-generation

Text-to-image generation and image transformation with Stable Diffusion models via HuggingFace Diffusers. Supports multiple generation modes: text-to-image, image-to-image translation, inpainting, outpainting, and ControlNet spatial conditioning for precise control Compatible with SD 1.5, SDXL, SD 3.0, and Flux models; includes scheduler swapping (Euler, DPM-Solver, LCM) for quality and speed trade-offs LoRA adapter support for efficient style fine-tuning and multi-adapter composition with adjustable weights Memory optimization tools including CPU offloading, attention slicing, xFormers integration, and VAE tiling for resource-constrained environments

INSTALLATION
npx skills add https://github.com/davila7/claude-code-templates --skill stable-diffusion-image-generation
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Stable Diffusion Image Generation

Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.

When to use Stable Diffusion

Use Stable Diffusion when:

  • Generating images from text descriptions
  • Performing image-to-image translation (style transfer, enhancement)
  • Inpainting (filling in masked regions)
  • Outpainting (extending images beyond boundaries)
  • Creating variations of existing images
  • Building custom image generation workflows

Key features:

  • Text-to-Image: Generate images from natural language prompts
  • Image-to-Image: Transform existing images with text guidance
  • Inpainting: Fill masked regions with context-aware content
  • ControlNet: Add spatial conditioning (edges, poses, depth)
  • LoRA Support: Efficient fine-tuning and style adaptation
  • Multiple Models: SD 1.5, SDXL, SD 3.0, Flux support

Use alternatives instead:

  • DALL-E 3: For API-based generation without GPU
  • Midjourney: For artistic, stylized outputs
  • Imagen: For Google Cloud integration
  • Leonardo.ai: For web-based creative workflows

Quick start

Installation

pip install diffusers transformers accelerate torch

pip install xformers  # Optional: memory-efficient attention

Basic text-to-image

from diffusers import DiffusionPipeline

import torch

# Load pipeline (auto-detects model type)

pipe = DiffusionPipeline.from_pretrained(

    "stable-diffusion-v1-5/stable-diffusion-v1-5",

    torch_dtype=torch.float16

)

pipe.to("cuda")

# Generate image

image = pipe(

    "A serene mountain landscape at sunset, highly detailed",

    num_inference_steps=50,

    guidance_scale=7.5

).images[0]

image.save("output.png")

Using SDXL (higher quality)

from diffusers import AutoPipelineForText2Image

import torch

pipe = AutoPipelineForText2Image.from_pretrained(

    "stabilityai/stable-diffusion-xl-base-1.0",

    torch_dtype=torch.float16,

    variant="fp16"

)

pipe.to("cuda")

# Enable memory optimization

pipe.enable_model_cpu_offload()

image = pipe(

    prompt="A futuristic city with flying cars, cinematic lighting",

    height=1024,

    width=1024,

    num_inference_steps=30

).images[0]

Architecture overview

Three-pillar design

Diffusers is built around three core components:

Pipeline (orchestration)

├── Model (neural networks)

│   ├── UNet / Transformer (noise prediction)

│   ├── VAE (latent encoding/decoding)

│   └── Text Encoder (CLIP/T5)

└── Scheduler (denoising algorithm)

Pipeline inference flow

Text Prompt → Text Encoder → Text Embeddings

                                    ↓

Random Noise → [Denoising Loop] ← Scheduler

                      ↓

               Predicted Noise

                      ↓

              VAE Decoder → Final Image

Core concepts

Pipelines

Pipelines orchestrate complete workflows:

Pipeline

Purpose

StableDiffusionPipeline

Text-to-image (SD 1.x/2.x)

StableDiffusionXLPipeline

Text-to-image (SDXL)

StableDiffusion3Pipeline

Text-to-image (SD 3.0)

FluxPipeline

Text-to-image (Flux models)

StableDiffusionImg2ImgPipeline

Image-to-image

StableDiffusionInpaintPipeline

Inpainting

Schedulers

Schedulers control the denoising process:

Scheduler

Steps

Quality

Use Case

EulerDiscreteScheduler

20-50

Good

Default choice

EulerAncestralDiscreteScheduler

20-50

Good

More variation

DPMSolverMultistepScheduler

15-25

Excellent

Fast, high quality

DDIMScheduler

50-100

Good

Deterministic

LCMScheduler

4-8

Good

Very fast

UniPCMultistepScheduler

15-25

Excellent

Fast convergence

Swapping schedulers

from diffusers import DPMSolverMultistepScheduler

# Swap for faster generation

pipe.scheduler = DPMSolverMultistepScheduler.from_config(

    pipe.scheduler.config

)

# Now generate with fewer steps

image = pipe(prompt, num_inference_steps=20).images[0]

Generation parameters

Key parameters

Parameter

Default

Description

prompt

Required

Text description of desired image

negative_prompt

None

What to avoid in the image

num_inference_steps

50

Denoising steps (more = better quality)

guidance_scale

7.5

Prompt adherence (7-12 typical)

height, width

512/1024

Output dimensions (multiples of 8)

generator

None

Torch generator for reproducibility

num_images_per_prompt

1

Batch size

Reproducible generation

import torch

generator = torch.Generator(device="cuda").manual_seed(42)

image = pipe(

    prompt="A cat wearing a top hat",

    generator=generator,

    num_inference_steps=50

).images[0]

Negative prompts

image = pipe(

    prompt="Professional photo of a dog in a garden",

    negative_prompt="blurry, low quality, distorted, ugly, bad anatomy",

    guidance_scale=7.5

).images[0]

Image-to-image

Transform existing images with text guidance:

from diffusers import AutoPipelineForImage2Image

from PIL import Image

pipe = AutoPipelineForImage2Image.from_pretrained(

    "stable-diffusion-v1-5/stable-diffusion-v1-5",

    torch_dtype=torch.float16

).to("cuda")

init_image = Image.open("input.jpg").resize((512, 512))

image = pipe(

    prompt="A watercolor painting of the scene",

    image=init_image,

    strength=0.75,  # How much to transform (0-1)

    num_inference_steps=50

).images[0]

Inpainting

Fill masked regions:

from diffusers import AutoPipelineForInpainting

from PIL import Image

pipe = AutoPipelineForInpainting.from_pretrained(

    "runwayml/stable-diffusion-inpainting",

    torch_dtype=torch.float16

).to("cuda")

image = Image.open("photo.jpg")

mask = Image.open("mask.png")  # White = inpaint region

result = pipe(

    prompt="A red car parked on the street",

    image=image,

    mask_image=mask,

    num_inference_steps=50

).images[0]

ControlNet

Add spatial conditioning for precise control:

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

import torch

# Load ControlNet for edge conditioning

controlnet = ControlNetModel.from_pretrained(

    "lllyasviel/control_v11p_sd15_canny",

    torch_dtype=torch.float16

)

pipe = StableDiffusionControlNetPipeline.from_pretrained(

    "stable-diffusion-v1-5/stable-diffusion-v1-5",

    controlnet=controlnet,

    torch_dtype=torch.float16

).to("cuda")

# Use Canny edge image as control

control_image = get_canny_image(input_image)

image = pipe(

    prompt="A beautiful house in the style of Van Gogh",

    image=control_image,

    num_inference_steps=30

).images[0]

Available ControlNets

ControlNet

Input Type

Use Case

canny

Edge maps

Preserve structure

openpose

Pose skeletons

Human poses

depth

Depth maps

3D-aware generation

normal

Normal maps

Surface details

mlsd

Line segments

Architectural lines

scribble

Rough sketches

Sketch-to-image

LoRA adapters

Load fine-tuned style adapters:

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(

    "stable-diffusion-v1-5/stable-diffusion-v1-5",

    torch_dtype=torch.float16

).to("cuda")

# Load LoRA weights

pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")

# Generate with LoRA style

image = pipe("A portrait in the trained style").images[0]

# Adjust LoRA strength

pipe.fuse_lora(lora_scale=0.8)

# Unload LoRA

pipe.unload_lora_weights()

Multiple LoRAs

# Load multiple LoRAs

pipe.load_lora_weights("lora1", adapter_name="style")

pipe.load_lora_weights("lora2", adapter_name="character")

# Set weights for each

pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])

image = pipe("A portrait").images[0]

Memory optimization

Enable CPU offloading

# Model CPU offload - moves models to CPU when not in use

pipe.enable_model_cpu_offload()

# Sequential CPU offload - more aggressive, slower

pipe.enable_sequential_cpu_offload()

Attention slicing

# Reduce memory by computing attention in chunks

pipe.enable_attention_slicing()

# Or specific chunk size

pipe.enable_attention_slicing("max")

xFormers memory-efficient attention

# Requires xformers package

pipe.enable_xformers_memory_efficient_attention()

VAE slicing for large images

# Decode latents in tiles for large images

pipe.enable_vae_slicing()

pipe.enable_vae_tiling()

Model variants

Loading different precisions

# FP16 (recommended for GPU)

pipe = DiffusionPipeline.from_pretrained(

    "model-id",

    torch_dtype=torch.float16,

    variant="fp16"

)

# BF16 (better precision, requires Ampere+ GPU)

pipe = DiffusionPipeline.from_pretrained(

    "model-id",

    torch_dtype=torch.bfloat16

)

Loading specific components

from diffusers import UNet2DConditionModel, AutoencoderKL

# Load custom VAE

vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")

# Use with pipeline

pipe = DiffusionPipeline.from_pretrained(

    "stable-diffusion-v1-5/stable-diffusion-v1-5",

    vae=vae,

    torch_dtype=torch.float16

)

Batch generation

Generate multiple images efficiently:

# Multiple prompts

prompts = [

    "A cat playing piano",

    "A dog reading a book",

    "A bird painting a picture"

]

images = pipe(prompts, num_inference_steps=30).images

# Multiple images per prompt

images = pipe(

    "A beautiful sunset",

    num_images_per_prompt=4,

    num_inference_steps=30

).images

Common workflows

Workflow 1: High-quality generation

from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler

import torch

# 1. Load SDXL with optimizations

pipe = StableDiffusionXLPipeline.from_pretrained(

    "stabilityai/stable-diffusion-xl-base-1.0",

    torch_dtype=torch.float16,

    variant="fp16"

)

pipe.to("cuda")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

pipe.enable_model_cpu_offload()

# 2. Generate with quality settings

image = pipe(

    prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur",

    negative_prompt="blurry, low quality, cartoon, anime, sketch",

    num_inference_steps=30,

    guidance_scale=7.5,

    height=1024,

    width=1024

).images[0]

Workflow 2: Fast prototyping

from diffusers import AutoPipelineForText2Image, LCMScheduler

import torch

# Use LCM for 4-8 step generation

pipe = AutoPipelineForText2Image.from_pretrained(

    "stabilityai/stable-diffusion-xl-base-1.0",

    torch_dtype=torch.float16

).to("cuda")

# Load LCM LoRA for fast generation

pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")

pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

pipe.fuse_lora()

# Generate in ~1 second

image = pipe(

    "A beautiful landscape",

    num_inference_steps=4,

    guidance_scale=1.0

).images[0]

Common issues

CUDA out of memory:

# Enable memory optimizations

pipe.enable_model_cpu_offload()

pipe.enable_attention_slicing()

pipe.enable_vae_slicing()

# Or use lower precision

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

Black/noise images:

# Check VAE configuration

# Use safety checker bypass if needed

pipe.safety_checker = None

# Ensure proper dtype consistency

pipe = pipe.to(dtype=torch.float16)

Slow generation:

# Use faster scheduler

from diffusers import DPMSolverMultistepScheduler

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Reduce steps

image = pipe(prompt, num_inference_steps=20).images[0]

References

Resources

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card