SKILL.md

Image OCR Expert

Expert in extracting, processing, and structuring text from images using OCR tools and techniques.

Description

This skill provides specialized knowledge for extracting text from images, including:

Tool and library selection by use case (Tesseract, EasyOCR, PaddleOCR, cloud APIs)

Image preprocessing to maximize OCR accuracy

Post-processing and structuring of extracted text

Handling handwriting, receipts, invoices, documents, screenshots

Multilingual OCR and special character support

Integration into Python/Node.js/cloud pipelines

Triggers: ocr, extract text from image, image to text, read text image, optical character recognition, tesseract, easyocr, paddleocr, textract, vision api, document extraction, screenshot text, invoice ocr, receipt ocr, handwriting recognition, image text extraction

Tool Selection Guide

Tool

Best For

Languages

Accuracy

Cost

Tesseract

Local, simple docs, print text

100+

Medium

Free

EasyOCR

Local, photos, multiple scripts

80+

High

Free

PaddleOCR

Local, CJK languages, tables

80+

Very High

Free

Google Vision API

Cloud, complex docs, handwriting

All

Excellent

Pay-per-use

AWS Textract

Cloud, forms, tables, invoices

Limited

Excellent

Pay-per-use

Azure Computer Vision

Cloud, general OCR

164

Excellent

Pay-per-use

Surya

Local, multilingual PDFs

90+

High

Free

Docling

Local, PDFs, structured output

Many

High

Free

Decision Tree

Is accuracy critical and budget available?

├─ YES → Google Vision API or AWS Textract

└─ NO → Local solution

    ├─ CJK (Chinese/Japanese/Korean) or tables? → PaddleOCR

    ├─ General photos or multiple languages? → EasyOCR

    ├─ Simple printed English docs? → Tesseract

    └─ PDF documents with structure? → Docling or Surya

Python Implementations

Tesseract (pytesseract)

import pytesseract

from PIL import Image

import cv2

import numpy as np

def extract_text_tesseract(image_path: str, lang: str = "eng") -> str:

    """Extract text using Tesseract. Best for clean printed documents."""

    image = Image.open(image_path)

    # Config: --psm 6 = assume uniform block of text

    config = "--psm 6 --oem 3"

    text = pytesseract.image_to_string(image, lang=lang, config=config)

    return text.strip()

def extract_with_confidence(image_path: str) -> list[dict]:

    """Extract text with bounding boxes and confidence scores."""

    image = Image.open(image_path)

    data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

    results = []

    for i, word in enumerate(data["text"]):

        if word.strip() and int(data["conf"][i]) > 30:

            results.append({

                "text": word,

                "confidence": data["conf"][i],

                "bbox": {

                    "x": data["left"][i],

                    "y": data["top"][i],

                    "width": data["width"][i],

                    "height": data["height"][i],

                }

            })

    return results

# Install: pip install pytesseract pillow

# System: apt install tesseract-ocr (Linux) / brew install tesseract (Mac)

EasyOCR

import easyocr

from pathlib import Path

def extract_text_easyocr(

    image_path: str,

    languages: list[str] = ["en"],

    detail: bool = False

) -> str | list:

    """

    Extract text using EasyOCR. Best for photos and multiple languages.

    languages: ['en'], ['en', 'es'], ['ch_sim', 'en'], etc.

    """

    reader = easyocr.Reader(languages, gpu=False)  # gpu=True if CUDA available

    results = reader.readtext(image_path)

    if not detail:

        # Return plain text sorted by vertical position

        results_sorted = sorted(results, key=lambda x: x[0][0][1])

        return "\n".join([text for _, text, conf in results_sorted if conf > 0.3])

    return [

        {

            "text": text,

            "confidence": round(conf, 3),

            "bbox": bbox,

        }

        for bbox, text, conf in results

    ]

# Install: pip install easyocr

PaddleOCR (best for CJK and tables)

from paddleocr import PaddleOCR

import json

def extract_text_paddle(

    image_path: str,

    lang: str = "en",  # "en", "ch", "japan", "korean", "es", etc.

    use_angle_cls: bool = True,

) -> str:

    """Extract text using PaddleOCR. Best for CJK and structured documents."""

    ocr = PaddleOCR(use_angle_cls=use_angle_cls, lang=lang, show_log=False)

    result = ocr.ocr(image_path, cls=True)

    lines = []

    if result and result[0]:

        # Sort by y position (top to bottom)

        items = sorted(result[0], key=lambda x: x[0][0][1])

        lines = [item[1][0] for item in items if item[1][1] > 0.3]

    return "\n".join(lines)

# Install: pip install paddlepaddle paddleocr

Google Vision API

from google.cloud import vision

import io

def extract_text_google_vision(image_path: str) -> dict:

    """

    Extract text using Google Vision API.

    Requires: GOOGLE_APPLICATION_CREDENTIALS env var set.

    """

    client = vision.ImageAnnotatorClient()

    with io.open(image_path, "rb") as image_file:

        content = image_file.read()

    image = vision.Image(content=content)

    # Full text detection (better for documents)

    response = client.document_text_detection(image=image)

    document = response.full_text_annotation

    return {

        "text": document.text,

        "pages": [

            {

                "blocks": [

                    {

                        "text": " ".join(

                            symbol.text

                            for para in block.paragraphs

                            for word in para.words

                            for symbol in word.symbols

                        ),

                        "confidence": block.confidence,

                    }

                    for block in page.blocks

                ]

            }

            for page in document.pages

        ]

    }

# Install: pip install google-cloud-vision

AWS Textract (best for forms and invoices)

import boto3

import json

def extract_text_textract(image_path: str, region: str = "us-east-1") -> dict:

    """

    Extract text, forms, and tables using AWS Textract.

    Handles key-value pairs and structured tables automatically.

    """

    client = boto3.client("textract", region_name=region)

    with open(image_path, "rb") as f:

        image_bytes = f.read()

    response = client.analyze_document(

        Document={"Bytes": image_bytes},

        FeatureTypes=["TABLES", "FORMS"]

    )

    # Extract raw text

    blocks = response["Blocks"]

    lines = [b["Text"] for b in blocks if b["BlockType"] == "LINE"]

    # Extract key-value pairs (forms)

    key_values = {}

    key_map = {b["Id"]: b for b in blocks if b["BlockType"] == "KEY_VALUE_SET" and "KEY" in b.get("EntityTypes", [])}

    value_map = {b["Id"]: b for b in blocks if b["BlockType"] == "KEY_VALUE_SET" and "VALUE" in b.get("EntityTypes", [])}

    for key_block in key_map.values():

        key_text = _get_text_from_block(key_block, blocks)

        for rel in key_block.get("Relationships", []):

            if rel["Type"] == "VALUE":

                for val_id in rel["Ids"]:

                    if val_id in value_map:

                        val_text = _get_text_from_block(value_map[val_id], blocks)

                        key_values[key_text] = val_text

    return {

        "text": "\n".join(lines),

        "form_fields": key_values,

    }

def _get_text_from_block(block, all_blocks):

    word_ids = []

    for rel in block.get("Relationships", []):

        if rel["Type"] == "CHILD":

            word_ids.extend(rel["Ids"])

    block_map = {b["Id"]: b for b in all_blocks}

    words = [block_map[wid]["Text"] for wid in word_ids if wid in block_map and block_map[wid]["BlockType"] == "WORD"]

    return " ".join(words)

# Install: pip install boto3

Image Preprocessing

Preprocessing is the #1 factor in OCR accuracy. Always apply before running OCR.

import cv2

import numpy as np

from PIL import Image, ImageEnhance, ImageFilter

def preprocess_for_ocr(image_path: str, output_path: str = None) -> np.ndarray:

    """

    Full preprocessing pipeline for maximum OCR accuracy.

    Apply selectively based on image type.

    """

    img = cv2.imread(image_path)

    # 1. Convert to grayscale

    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # 2. Resize if too small (OCR works better at 300+ DPI)

    height, width = gray.shape

    if width < 1000:

        scale = 2000 / width

        gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)

    # 3. Deskew (fix rotation)

    gray = deskew(gray)

    # 4. Denoise

    denoised = cv2.fastNlMeansDenoising(gray, h=10)

    # 5. Binarization (choose one based on lighting)

    # Option A: Otsu (uniform lighting)

    _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    # Option B: Adaptive (uneven lighting, shadows)

    # binary = cv2.adaptiveThreshold(denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,

    #                                 cv2.THRESH_BINARY, 11, 2)

    # 6. Morphological cleanup (remove noise dots)

    kernel = np.ones((1, 1), np.uint8)

    cleaned = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)

    if output_path:

        cv2.imwrite(output_path, cleaned)

    return cleaned

def deskew(image: np.ndarray) -> np.ndarray:

    """Correct image rotation using projection analysis."""

    coords = np.column_stack(np.where(image > 0))

    angle = cv2.minAreaRect(coords)[-1]

    if angle < -45:

        angle = -(90 + angle)

    else:

        angle = -angle

    if abs(angle) < 0.5:  # Skip if nearly straight

        return image

    h, w = image.shape

    center = (w // 2, h // 2)

    M = cv2.getRotationMatrix2D(center, angle, 1.0)

    return cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC,

                          borderMode=cv2.BORDER_REPLICATE)

def enhance_contrast(image_path: str) -> Image.Image:

    """Enhance contrast using PIL - useful for faded text."""

    img = Image.open(image_path).convert("L")

    enhancer = ImageEnhance.Contrast(img)

    return enhancer.enhance(2.0)

# Install: pip install opencv-python pillow

Preprocessing Decision Guide

Image Problem

Solution

Rotated/skewed text

deskew()

Low resolution

Upscale 2x with cv2.INTER_CUBIC

Uneven lighting/shadows

Adaptive thresholding

Uniform background

Otsu thresholding

Noisy/grainy

fastNlMeansDenoising

Faded text

PIL Contrast enhancer

Color background

Convert to grayscale first

Handwriting

Skip binarization, use cloud API

PDF to Text Extraction

import fitz  # PyMuPDF - for native text extraction

from pdf2image import convert_from_path  # for scanned PDFs

import pytesseract

def extract_pdf_text(pdf_path: str, ocr_fallback: bool = True) -> str:

    """

    Smart PDF extraction:

    - Uses native text layer if available (fast, accurate)

    - Falls back to OCR for scanned pages

    """

    doc = fitz.open(pdf_path)

    full_text = []

    for page_num, page in enumerate(doc):

        # Try native text extraction first

        text = page.get_text().strip()

        if text and len(text) > 50:

            full_text.append(text)

        elif ocr_fallback:

            # Scanned page — render and OCR

            pix = page.get_pixmap(dpi=300)

            img_path = f"/tmp/page_{page_num}.png"

            pix.save(img_path)

            ocr_text = pytesseract.image_to_string(img_path)

            full_text.append(ocr_text)

    doc.close()

    return "\n\n".join(full_text)

# Install: pip install PyMuPDF pdf2image pytesseract

# System: apt install poppler-utils (for pdf2image on Linux)

Post-Processing Extracted Text

import re

from difflib import SequenceMatcher

def clean_ocr_text(text: str) -> str:

    """Standard cleanup for OCR output."""

    # Remove non-printable characters

    text = re.sub(r"[^\x20-\x7E\n\t]", "", text)

    # Normalize whitespace

    text = re.sub(r" +", " ", text)

    text = re.sub(r"\n{3,}", "\n\n", text)

    # Fix common OCR misreads

    corrections = {

        r"\b0(?=[a-zA-Z])": "O",    # 0 misread as O before letter

        r"(?<=[a-zA-Z])0\b": "O",    # O misread as 0 after letter

        r"\bl\b": "I",               # lowercase l misread as I (context-dependent)

        r"rn": "m",                  # rn → m (common serif font error)

    }

    for pattern, replacement in corrections.items():

        text = re.sub(pattern, replacement, text)

    return text.strip()

def extract_structured_data(text: str) -> dict:

    """Extract common structured fields from OCR text."""

    patterns = {

        "email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",

        "phone": r"[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4,6}",

        "date": r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b",

        "amount": r"\$\s?\d+(?:,\d{3})*(?:\.\d{2})?",

        "url": r"https?://[^\s]+",

    }

    return {

        field: re.findall(pattern, text)

        for field, pattern in patterns.items()

    }

def merge_multiline_words(text: str) -> str:

    """Fix hyphenated words split across lines (common in PDFs)."""

    return re.sub(r"(\w)-\n(\w)", r"\1\2", text)

Node.js / TypeScript

// Using Tesseract.js (pure JS, no native deps needed)

import Tesseract from "tesseract.js";

async function extractText(imagePath: string, lang = "eng"): Promise<string> {

  const { data } = await Tesseract.recognize(imagePath, lang, {

    logger: () => {}, // suppress progress logs

  });

  return data.text.trim();

}

// With confidence filtering

async function extractWithConfidence(imagePath: string) {

  const { data } = await Tesseract.recognize(imagePath, "eng");

  return data.words

    .filter((word) => word.confidence > 70)

    .map((word) => ({

      text: word.text,

      confidence: word.confidence,

      bbox: word.bbox,

    }));

}

// Install: npm install tesseract.js

// Using Google Vision API from Node.js

import vision from "@google-cloud/vision";

const client = new vision.ImageAnnotatorClient();

async function extractTextCloud(imagePath: string): Promise<string> {

  const [result] = await client.documentTextDetection(imagePath);

  return result.fullTextAnnotation?.text ?? "";

}

// Install: npm install @google-cloud/vision

Claude Vision API for OCR

Use Claude's vision capability when you need structured extraction + understanding:

import anthropic

import base64

from pathlib import Path

def extract_with_claude(image_path: str, instruction: str = None) -> str:

    """

    Use Claude to extract and structure text from an image.

    Best when you need semantic understanding, not just raw text.

    """

    client = anthropic.Anthropic()

    image_data = base64.standard_b64encode(Path(image_path).read_bytes()).decode()

    ext = Path(image_path).suffix.lower()

    media_types = {".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".png": "image/png", ".webp": "image/webp"}

    media_type = media_types.get(ext, "image/jpeg")

    prompt = instruction or (

        "Extract ALL text from this image exactly as it appears. "

        "Preserve the original structure, line breaks, and formatting. "

        "Return only the extracted text, nothing else."

    )

    message = client.messages.create(

        model="claude-opus-4-6",

        max_tokens=4096,

        messages=[

            {

                "role": "user",

                "content": [

                    {

                        "type": "image",

                        "source": {

                            "type": "base64",

                            "media_type": media_type,

                            "data": image_data,

                        },

                    },

                    {"type": "text", "text": prompt},

                ],

            }

        ],

    )

    return message.content[0].text

# Example: structured invoice extraction

def extract_invoice(image_path: str) -> dict:

    result = extract_with_claude(

        image_path,

        instruction="""Extract all data from this invoice and return as JSON:

{

  "invoice_number": "",

  "date": "",

  "vendor": {"name": "", "address": "", "email": ""},

  "items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}],

  "subtotal": 0,

  "tax": 0,

  "total": 0

}

Return only valid JSON, no explanation."""

    )

    import json

    return json.loads(result)

When to Use Claude vs Traditional OCR

Scenario

Use Claude

Use Traditional OCR

Extract + understand structure

✅

❌

Invoice/receipt parsing

✅

❌ (Textract is also good)

Handwriting with context

✅

❌

Large volume (1000s of images)

❌ (cost)

✅

Simple raw text extraction

❌ (overkill)

✅

Tables with complex structure

✅

PaddleOCR / Textract

Real-time / low latency

❌

✅

Accuracy Benchmarks by Image Type

Image Type

Tesseract

EasyOCR

PaddleOCR

Google Vision

Printed documents (clean)

95%

97%

99%

Screenshots

90%

95%

98%

Photos of documents

70%

88%

90%

97%

Handwriting

40%

55%

85%

Low res / blurry

45%

70%

72%

80%

Receipts / invoices

75%

85%

88%

97%

Chinese/Japanese/Korean

60%*

85%

95%

99%

*Requires additional language pack installation

Common Errors and Fixes

Tesseract returns garbage text

Cause: Image too small or too noisy

Fix: Upscale 2x, apply denoising and binarization

EasyOCR misses text in columns

Cause: Default layout analysis fails on multi-column

Fix: Crop each column separately and OCR individually

PaddleOCR slow on CPU

Cause: Large model loaded

Fix: Use use_gpu=True if available, or use_angle_cls=False for horizontal text

Bounding boxes don't align with text

Cause: Image was rotated before OCR

Fix: Apply deskew() in preprocessing

Cloud API returns empty for some regions

Cause: Low contrast or very small text

Fix: Preprocess image, increase DPI, crop region of interest

PDF text layer has wrong encoding

Cause: Non-standard font embedding

Fix: Use fitz.Page.get_text("rawdict") to inspect encoding, or skip to OCR fallback

Quick Start Templates

Minimal local OCR (Python)

pip install easyocr

python -c "import easyocr; r=easyocr.Reader(['en']); print('\n'.join([t for _,t,c in r.readtext('image.png') if c>0.3]))"

Minimal cloud OCR (Node.js)

npm install tesseract.js

node -e "const T=require('tesseract.js'); T.recognize('image.png','eng').then(r=>console.log(r.data.text))"

Batch processing pipeline

from pathlib import Path

import easyocr

reader = easyocr.Reader(["en"], gpu=False)

def batch_ocr(folder: str, output_folder: str) -> None:

    Path(output_folder).mkdir(exist_ok=True)

    images = list(Path(folder).glob("*.{png,jpg,jpeg,tiff,bmp}"))

    for img_path in images:

        results = reader.readtext(str(img_path))

        text = "\n".join(t for _, t, c in results if c > 0.3)

        out_path = Path(output_folder) / f"{img_path.stem}.txt"

        out_path.write_text(text, encoding="utf-8")

        print(f"✓ {img_path.name} → {out_path.name}")

    print(f"\nProcessed {len(images)} images.")

batch_ocr("./images", "./output")

Rules

Select the OCR engine based on the document type and accuracy requirements before writing code: Tesseract for local/offline simple documents, EasyOCR for multilingual handwriting, cloud APIs (Google Vision, AWS Textract) for production accuracy on structured documents

Image preprocessing (grayscale conversion, binarization, deskew) is required before Tesseract and EasyOCR for non-ideal inputs — skipping it causes significant accuracy degradation

OCR output must always be treated as unvalidated text — apply post-processing (regex, string normalization) before using extracted values in business logic

Never pass sensitive document images to cloud OCR APIs without confirming data privacy and compliance requirements with the project owner

Confidence scores from the OCR engine must be checked; results below the project-defined threshold must be flagged for human review rather than accepted automatically

image-ocr

SKILL.md

Image OCR Expert

Description

Tool Selection Guide

Decision Tree

Python Implementations

Tesseract (pytesseract)

EasyOCR

PaddleOCR (best for CJK and tables)

Google Vision API

AWS Textract (best for forms and invoices)

Image Preprocessing

Preprocessing Decision Guide

PDF to Text Extraction

Post-Processing Extracted Text

Node.js / TypeScript

Claude Vision API for OCR

When to Use Claude vs Traditional OCR

Accuracy Benchmarks by Image Type

Common Errors and Fixes

Tesseract returns garbage text

EasyOCR misses text in columns

PaddleOCR slow on CPU

Bounding boxes don't align with text

Cloud API returns empty for some regions

PDF text layer has wrong encoding

Quick Start Templates

Minimal local OCR (Python)

Minimal cloud OCR (Node.js)

Batch processing pipeline

Rules

Stop writing automation&scrapers

image-ocr

SKILL.md

Image OCR Expert

Description

Tool Selection Guide

Decision Tree

Python Implementations

Tesseract (pytesseract)

EasyOCR

PaddleOCR (best for CJK and tables)

Google Vision API

AWS Textract (best for forms and invoices)

Image Preprocessing

Preprocessing Decision Guide

PDF to Text Extraction

Post-Processing Extracted Text

Node.js / TypeScript

Claude Vision API for OCR

When to Use Claude vs Traditional OCR

Accuracy Benchmarks by Image Type

Common Errors and Fixes

Tesseract returns garbage text

EasyOCR misses text in columns

PaddleOCR slow on CPU

Bounding boxes don't align with text

Cloud API returns empty for some regions

PDF text layer has wrong encoding

Quick Start Templates

Minimal local OCR (Python)

Minimal cloud OCR (Node.js)

Batch processing pipeline

Rules

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers