blip-2-vision-language

Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text…

INSTALLATION
npx skills add https://github.com/davila7/claude-code-templates --skill blip-2-vision-language
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

BLIP-2: Vision-Language Pre-training

Comprehensive guide to using Salesforce's BLIP-2 for vision-language tasks with frozen image encoders and large language models.

When to use BLIP-2

Use BLIP-2 when:

  • Need high-quality image captioning with natural descriptions
  • Building visual question answering (VQA) systems
  • Require zero-shot image-text understanding without task-specific training
  • Want to leverage LLM reasoning for visual tasks
  • Building multimodal conversational AI
  • Need image-text retrieval or matching

Key features:

  • Q-Former architecture: Lightweight query transformer bridges vision and language
  • Frozen backbone efficiency: No need to fine-tune large vision/language models
  • Multiple LLM backends: OPT (2.7B, 6.7B) and FlanT5 (XL, XXL)
  • Zero-shot capabilities: Strong performance without task-specific training
  • Efficient training: Only trains Q-Former (~188M parameters)
  • State-of-the-art results: Beats larger models on VQA benchmarks

Use alternatives instead:

  • LLaVA: For instruction-following multimodal chat
  • InstructBLIP: For improved instruction-following (BLIP-2 successor)
  • GPT-4V/Claude 3: For production multimodal chat (proprietary)
  • CLIP: For simple image-text similarity without generation
  • Flamingo: For few-shot visual learning

Quick start

Installation

# HuggingFace Transformers (recommended)

pip install transformers accelerate torch Pillow

# Or LAVIS library (Salesforce official)

pip install salesforce-lavis

Basic image captioning

import torch

from PIL import Image

from transformers import Blip2Processor, Blip2ForConditionalGeneration

# Load model and processor

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")

model = Blip2ForConditionalGeneration.from_pretrained(

    "Salesforce/blip2-opt-2.7b",

    torch_dtype=torch.float16,

    device_map="auto"

)

# Load image

image = Image.open("photo.jpg").convert("RGB")

# Generate caption

inputs = processor(images=image, return_tensors="pt").to("cuda", torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=50)

caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(caption)

Visual question answering

# Ask a question about the image

question = "What color is the car in this image?"

inputs = processor(images=image, text=question, return_tensors="pt").to("cuda", torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=50)

answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(answer)

Using LAVIS library

import torch

from lavis.models import load_model_and_preprocess

from PIL import Image

# Load model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model, vis_processors, txt_processors = load_model_and_preprocess(

    name="blip2_opt",

    model_type="pretrain_opt2.7b",

    is_eval=True,

    device=device

)

# Process image

image = Image.open("photo.jpg").convert("RGB")

image = vis_processors["eval"](image).unsqueeze(0).to(device)

# Caption

caption = model.generate({"image": image})

print(caption)

# VQA

question = txt_processors["eval"]("What is in this image?")

answer = model.generate({"image": image, "prompt": question})

print(answer)

Core concepts

Architecture overview

BLIP-2 Architecture:

┌─────────────────────────────────────────────────────────────┐

│                        Q-Former                              │

│  ┌─────────────────────────────────────────────────────┐    │

│  │     Learned Queries (32 queries × 768 dim)          │    │

│  └────────────────────────┬────────────────────────────┘    │

│                           │                                  │

│  ┌────────────────────────▼────────────────────────────┐    │

│  │    Cross-Attention with Image Features               │    │

│  └────────────────────────┬────────────────────────────┘    │

│                           │                                  │

│  ┌────────────────────────▼────────────────────────────┐    │

│  │    Self-Attention Layers (Transformer)               │    │

│  └────────────────────────┬────────────────────────────┘    │

└───────────────────────────┼─────────────────────────────────┘

                            │

┌───────────────────────────▼─────────────────────────────────┐

│  Frozen Vision Encoder    │      Frozen LLM                  │

│  (ViT-G/14 from EVA-CLIP) │      (OPT or FlanT5)            │

└─────────────────────────────────────────────────────────────┘

Model variants

Model

LLM Backend

Size

Use Case

blip2-opt-2.7b

OPT-2.7B

~4GB

General captioning, VQA

blip2-opt-6.7b

OPT-6.7B

~8GB

Better reasoning

blip2-flan-t5-xl

FlanT5-XL

~5GB

Instruction following

blip2-flan-t5-xxl

FlanT5-XXL

~13GB

Best quality

Q-Former components

Component

Description

Parameters

Learned queries

Fixed set of learnable embeddings

32 × 768

Image transformer

Cross-attention to vision features

~108M

Text transformer

Self-attention for text

~108M

Linear projection

Maps to LLM dimension

Varies

Advanced usage

Batch processing

from PIL import Image

import torch

# Load multiple images

images = [Image.open(f"image_{i}.jpg").convert("RGB") for i in range(4)]

questions = [

    "What is shown in this image?",

    "Describe the scene.",

    "What colors are prominent?",

    "Is there a person in this image?"

]

# Process batch

inputs = processor(

    images=images,

    text=questions,

    return_tensors="pt",

    padding=True

).to("cuda", torch.float16)

# Generate

generated_ids = model.generate(**inputs, max_new_tokens=50)

answers = processor.batch_decode(generated_ids, skip_special_tokens=True)

for q, a in zip(questions, answers):

    print(f"Q: {q}\nA: {a}\n")

Controlling generation

# Control generation parameters

generated_ids = model.generate(

    **inputs,

    max_new_tokens=100,

    min_length=20,

    num_beams=5,              # Beam search

    no_repeat_ngram_size=2,   # Avoid repetition

    top_p=0.9,                # Nucleus sampling

    temperature=0.7,          # Creativity

    do_sample=True,           # Enable sampling

)

# For deterministic output

generated_ids = model.generate(

    **inputs,

    max_new_tokens=50,

    num_beams=5,

    do_sample=False,

)

Memory optimization

# 8-bit quantization

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = Blip2ForConditionalGeneration.from_pretrained(

    "Salesforce/blip2-opt-6.7b",

    quantization_config=quantization_config,

    device_map="auto"

)

# 4-bit quantization (more aggressive)

quantization_config = BitsAndBytesConfig(

    load_in_4bit=True,

    bnb_4bit_compute_dtype=torch.float16

)

model = Blip2ForConditionalGeneration.from_pretrained(

    "Salesforce/blip2-flan-t5-xxl",

    quantization_config=quantization_config,

    device_map="auto"

)

Image-text matching

# Using LAVIS for ITM (Image-Text Matching)

from lavis.models import load_model_and_preprocess

model, vis_processors, txt_processors = load_model_and_preprocess(

    name="blip2_image_text_matching",

    model_type="pretrain",

    is_eval=True,

    device=device

)

image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

text = txt_processors["eval"]("a dog sitting on grass")

# Get matching score

itm_output = model({"image": image, "text_input": text}, match_head="itm")

itm_scores = torch.nn.functional.softmax(itm_output, dim=1)

print(f"Match probability: {itm_scores[:, 1].item():.3f}")

Feature extraction

# Extract image features with Q-Former

from lavis.models import load_model_and_preprocess

model, vis_processors, _ = load_model_and_preprocess(

    name="blip2_feature_extractor",

    model_type="pretrain",

    is_eval=True,

    device=device

)

image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

# Get features

features = model.extract_features({"image": image}, mode="image")

image_embeds = features.image_embeds  # Shape: [1, 32, 768]

image_features = features.image_embeds_proj  # Projected for matching

Common workflows

Workflow 1: Image captioning pipeline

import torch

from PIL import Image

from transformers import Blip2Processor, Blip2ForConditionalGeneration

from pathlib import Path

class ImageCaptioner:

    def __init__(self, model_name="Salesforce/blip2-opt-2.7b"):

        self.processor = Blip2Processor.from_pretrained(model_name)

        self.model = Blip2ForConditionalGeneration.from_pretrained(

            model_name,

            torch_dtype=torch.float16,

            device_map="auto"

        )

    def caption(self, image_path: str, prompt: str = None) -> str:

        image = Image.open(image_path).convert("RGB")

        if prompt:

            inputs = self.processor(images=image, text=prompt, return_tensors="pt")

        else:

            inputs = self.processor(images=image, return_tensors="pt")

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(

            **inputs,

            max_new_tokens=50,

            num_beams=5

        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def caption_batch(self, image_paths: list, prompt: str = None) -> list:

        images = [Image.open(p).convert("RGB") for p in image_paths]

        if prompt:

            inputs = self.processor(

                images=images,

                text=[prompt] * len(images),

                return_tensors="pt",

                padding=True

            )

        else:

            inputs = self.processor(images=images, return_tensors="pt", padding=True)

        inputs = inputs.to("cuda", torch.float16)

        generated_ids = self.model.generate(**inputs, max_new_tokens=50)

        return self.processor.batch_decode(generated_ids, skip_special_tokens=True)

# Usage

captioner = ImageCaptioner()

# Single image

caption = captioner.caption("photo.jpg")

print(f"Caption: {caption}")

# With prompt for style

caption = captioner.caption("photo.jpg", "a detailed description of")

print(f"Detailed: {caption}")

# Batch processing

captions = captioner.caption_batch(["img1.jpg", "img2.jpg", "img3.jpg"])

for i, cap in enumerate(captions):

    print(f"Image {i+1}: {cap}")

Workflow 2: Visual Q&A system

class VisualQA:

    def __init__(self, model_name="Salesforce/blip2-flan-t5-xl"):

        self.processor = Blip2Processor.from_pretrained(model_name)

        self.model = Blip2ForConditionalGeneration.from_pretrained(

            model_name,

            torch_dtype=torch.float16,

            device_map="auto"

        )

        self.current_image = None

        self.current_inputs = None

    def set_image(self, image_path: str):

        """Load image for multiple questions."""

        self.current_image = Image.open(image_path).convert("RGB")

    def ask(self, question: str) -> str:

        """Ask a question about the current image."""

        if self.current_image is None:

            raise ValueError("No image set. Call set_image() first.")

        # Format question for FlanT5

        prompt = f"Question: {question} Answer:"

        inputs = self.processor(

            images=self.current_image,

            text=prompt,

            return_tensors="pt"

        ).to("cuda", torch.float16)

        generated_ids = self.model.generate(

            **inputs,

            max_new_tokens=50,

            num_beams=5

        )

        return self.processor.decode(generated_ids[0], skip_special_tokens=True)

    def ask_multiple(self, questions: list) -> dict:

        """Ask multiple questions about current image."""

        return {q: self.ask(q) for q in questions}

# Usage

vqa = VisualQA()

vqa.set_image("scene.jpg")

# Ask questions

print(vqa.ask("What objects are in this image?"))

print(vqa.ask("What is the weather like?"))

print(vqa.ask("How many people are there?"))

# Batch questions

results = vqa.ask_multiple([

    "What is the main subject?",

    "What colors are dominant?",

    "Is this indoors or outdoors?"

])

Workflow 3: Image search/retrieval

import torch

import numpy as np

from PIL import Image

from lavis.models import load_model_and_preprocess

class ImageSearchEngine:

    def __init__(self):

        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        self.model, self.vis_processors, self.txt_processors = load_model_and_preprocess(

            name="blip2_feature_extractor",

            model_type="pretrain",

            is_eval=True,

            device=self.device

        )

        self.image_features = []

        self.image_paths = []

    def index_images(self, image_paths: list):

        """Build index from images."""

        self.image_paths = image_paths

        for path in image_paths:

            image = Image.open(path).convert("RGB")

            image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)

            with torch.no_grad():

                features = self.model.extract_features({"image": image}, mode="image")

                # Use projected features for matching

                self.image_features.append(

                    features.image_embeds_proj.mean(dim=1).cpu().numpy()

                )

        self.image_features = np.vstack(self.image_features)

    def search(self, query: str, top_k: int = 5) -> list:

        """Search images by text query."""

        # Get text features

        text = self.txt_processors["eval"](query)

        text_input = {"text_input": [text]}

        with torch.no_grad():

            text_features = self.model.extract_features(text_input, mode="text")

            text_embeds = text_features.text_embeds_proj[:, 0].cpu().numpy()

        # Compute similarities

        similarities = np.dot(self.image_features, text_embeds.T).squeeze()

        top_indices = np.argsort(similarities)[::-1][:top_k]

        return [(self.image_paths[i], similarities[i]) for i in top_indices]

# Usage

engine = ImageSearchEngine()

engine.index_images(["img1.jpg", "img2.jpg", "img3.jpg", ...])

# Search

results = engine.search("a sunset over the ocean", top_k=5)

for path, score in results:

    print(f"{path}: {score:.3f}")

Output format

Generation output

# Direct generation returns token IDs

generated_ids = model.generate(**inputs, max_new_tokens=50)

# Shape: [batch_size, sequence_length]

# Decode to text

text = processor.batch_decode(generated_ids, skip_special_tokens=True)

# Returns: list of strings

Feature extraction output

# Q-Former outputs

features = model.extract_features({"image": image}, mode="image")

features.image_embeds          # [B, 32, 768] - Q-Former outputs

features.image_embeds_proj     # [B, 32, 256] - Projected for matching

features.text_embeds          # [B, seq_len, 768] - Text features

features.text_embeds_proj     # [B, 256] - Projected text (CLS)

Performance optimization

GPU memory requirements

Model

FP16 VRAM

INT8 VRAM

INT4 VRAM

blip2-opt-2.7b

~8GB

~5GB

~3GB

blip2-opt-6.7b

~16GB

~9GB

~5GB

blip2-flan-t5-xl

~10GB

~6GB

~4GB

blip2-flan-t5-xxl

~26GB

~14GB

~8GB

Speed optimization

# Use Flash Attention if available

model = Blip2ForConditionalGeneration.from_pretrained(

    "Salesforce/blip2-opt-2.7b",

    torch_dtype=torch.float16,

    attn_implementation="flash_attention_2",  # Requires flash-attn

    device_map="auto"

)

# Compile model (PyTorch 2.0+)

model = torch.compile(model)

# Use smaller images (if quality allows)

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")

# Default is 224x224, which is optimal

Common issues

Issue

Solution

CUDA OOM

Use INT8/INT4 quantization, smaller model

Slow generation

Use greedy decoding, reduce max_new_tokens

Poor captions

Try FlanT5 variant, use prompts

Hallucinations

Lower temperature, use beam search

Wrong answers

Rephrase question, provide context

References

Resources

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card