SKILL.md

Kreuzberg Document Extraction

Name: kreuzberg
Author: kreuzberg-dev

Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.

Use this skill when writing code that:

Extracts text or metadata from documents

Performs OCR on scanned documents or images

Batch-processes multiple files

Configures extraction options (output format, chunking, OCR, language detection)

Implements custom plugins (post-processors, validators, OCR backends)

Installation

Python

pip install kreuzberg

# Optional OCR backends:

pip install kreuzberg[easyocr]    # EasyOCR

Node.js

npm install @kreuzberg/node

Rust

# Cargo.toml

[dependencies]

kreuzberg = { version = "4", features = ["tokio-runtime"] }

# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,

#           embeddings, language-detection, keywords-yake, keywords-rake

CLI

# Download from GitHub releases, or:

cargo install kreuzberg-cli

Quick Start

Python (Async)

from kreuzberg import extract_file

result = await extract_file("document.pdf")

print(result.content)       # extracted text

print(result.metadata)      # document metadata

print(result.tables)        # extracted tables

Python (Sync)

from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")

print(result.content)

Node.js

import { extractFile } from "@kreuzberg/node";

const result = await extractFile("document.pdf");

console.log(result.content);

console.log(result.metadata);

console.log(result.tables);

Node.js (Sync)

import { extractFileSync } from "@kreuzberg/node";

const result = extractFileSync("document.pdf");

Rust (Async)

use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]

async fn main() -> kreuzberg::Result<()> {

    let config = ExtractionConfig::default();

    let result = extract_file("document.pdf", None, &#x26;config).await?;

    println!("{}", result.content);

    Ok(())

}

Rust (Sync) — requires tokio-runtime feature

use kreuzberg::{extract_file_sync, ExtractionConfig};

fn main() -> kreuzberg::Result<()> {

    let config = ExtractionConfig::default();

    let result = extract_file_sync("document.pdf", None, &#x26;config)?;

    println!("{}", result.content);

    Ok(())

}

CLI

kreuzberg extract document.pdf

kreuzberg extract document.pdf --format json

kreuzberg extract document.pdf --output-format markdown

Configuration

All languages use the same configuration structure with language-appropriate naming conventions.

Python (snake_case)

from kreuzberg import (

    ExtractionConfig, OcrConfig, TesseractConfig,

    PdfConfig, ChunkingConfig,

)

config = ExtractionConfig(

    ocr=OcrConfig(

        backend="tesseract",

        language="eng",

        tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),

    ),

    pdf_options=PdfConfig(passwords=["secret123"]),

    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),

    output_format="markdown",

)

result = await extract_file("document.pdf", config=config)

Node.js (camelCase)

import { extractFile, type ExtractionConfig } from "@kreuzberg/node";

const config: ExtractionConfig = {

  ocr: { backend: "tesseract", language: "eng" },

  pdfOptions: { passwords: ["secret123"] },

  chunking: { maxChars: 1000, maxOverlap: 200 },

  outputFormat: "markdown",

};

const result = await extractFile("document.pdf", null, config);

Rust (snake_case)

use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};

let config = ExtractionConfig {

    ocr: Some(OcrConfig {

        backend: "tesseract".into(),

        language: "eng".into(),

        ..Default::default()

    }),

    chunking: Some(ChunkingConfig {

        max_characters: 1000,

        overlap: 200,

        ..Default::default()

    }),

    output_format: OutputFormat::Markdown,

    ..Default::default()

};

let result = extract_file("document.pdf", None, &#x26;config).await?;

Config File (TOML)

output_format = "markdown"

[ocr]

backend = "tesseract"

language = "eng"

[chunking]

max_chars = 1000

max_overlap = 200

[pdf_options]

passwords = ["secret123"]

# CLI: auto-discovers kreuzberg.toml in current/parent directories

kreuzberg extract doc.pdf

# or explicit:

kreuzberg extract doc.pdf --config kreuzberg.toml

kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'

Batch Processing

Python

from kreuzberg import batch_extract_files, batch_extract_files_sync

# Async

results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])

# Sync

results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])

for result in results:

    print(f"{len(result.content)} chars extracted")

Node.js

import { batchExtractFiles } from "@kreuzberg/node";

const results = await batchExtractFiles(["doc1.pdf", "doc2.docx"]);

Rust — requires tokio-runtime feature

use kreuzberg::{batch_extract_file, ExtractionConfig};

let config = ExtractionConfig::default();

let paths = vec!["doc1.pdf", "doc2.docx"];

let results = batch_extract_file(paths, &#x26;config).await?;

CLI

kreuzberg batch *.pdf --format json

kreuzberg batch docs/*.docx --output-format markdown

OCR

OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).

Backends

Tesseract (default): Built-in native binding. All Tesseract languages supported.

EasyOCR (Python only): pip install kreuzberg[easyocr]. Pass easyocr_kwargs={"gpu": True}.

PaddleOCR (Python only): Bundled since 4.8.5, no extra install needed. Pass paddleocr_kwargs={"use_angle_cls": True}.

Guten (Node.js only): Built-in OCR backend via GutenOcrBackend.

Language Codes

config = ExtractionConfig(ocr=OcrConfig(language="eng"))       # English

config = ExtractionConfig(ocr=OcrConfig(language="eng+deu"))   # Multiple

config = ExtractionConfig(ocr=OcrConfig(language="all"))       # All installed

Force OCR

config = ExtractionConfig(force_ocr=True)  # OCR even if text is extractable

ExtractionResult Fields

Field

Python

Node.js

Rust

Description

Text content

result.content

Extracted text (str/String)

MIME type

result.mime_type

result.mimeType

result.mime_type

Input document MIME type

Metadata

result.metadata

Document metadata (dict/object/HashMap)

Tables

result.tables

Extracted tables with cells + markdown

Languages

result.detected_languages

result.detectedLanguages

result.detected_languages

Detected languages (if enabled)

Chunks

result.chunks

Text chunks (if chunking enabled)

Images

result.images

Extracted images (if enabled)

Elements

result.elements

Semantic elements (if element_based format)

Pages

result.pages

Per-page content (if page extraction enabled)

Keywords

result.keywords

Extracted keywords (if enabled)

Error Handling

Python

from kreuzberg import (

    extract_file_sync, KreuzbergError, ParsingError,

    OCRError, ValidationError, MissingDependencyError,

)

try:

    result = extract_file_sync("file.pdf")

except ParsingError as e:

    print(f"Failed to parse: {e}")

except OCRError as e:

    print(f"OCR failed: {e}")

except ValidationError as e:

    print(f"Invalid input: {e}")

except MissingDependencyError as e:

    print(f"Missing dependency: {e}")

except KreuzbergError as e:

    print(f"Extraction failed: {e}")

Node.js

import {

  extractFile,

  KreuzbergError,

  ParsingError,

  OcrError,

  ValidationError,

  MissingDependencyError,

} from "@kreuzberg/node";

try {

  const result = await extractFile("file.pdf");

} catch (e) {

  if (e instanceof ParsingError) {

    /* ... */

  } else if (e instanceof OcrError) {

    /* ... */

  } else if (e instanceof ValidationError) {

    /* ... */

  } else if (e instanceof KreuzbergError) {

    /* ... */

  }

}

Rust

use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};

let config = ExtractionConfig::default();

match extract_file("file.pdf", None, &#x26;config).await {

    Ok(result) => println!("{}", result.content),

    Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),

    Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),

    Err(e) => eprintln!("Error: {e}"),

}

Common Pitfalls

Python ChunkingConfig fields: Use max_chars and max_overlap, NOT max_characters or overlap.

Rust extract_file signature: Third argument is &ExtractionConfig (a reference), not Option. Use &ExtractionConfig::default() for defaults.

Rust feature gates: extract_file_sync, batch_extract_file, and batch_extract_file_sync all require features = ["tokio-runtime"] in Cargo.toml.

Rust async context: extract_file is async. Use #[tokio::main] or call from an async context.

CLI --format vs --output-format: --format controls CLI output (text/json). --output-format controls content format (plain/markdown/djot/html).

Node.js extractFile signature: extractFile(path, mimeType?, config?) — mimeType is the second arg (pass null to skip).

Python detect_mime_type: The function for detecting from bytes is detect_mime_type(data). For paths use detect_mime_type_from_path(path).

Config file field names: Use snake_case in TOML/YAML/JSON config files (e.g., max_chars, max_overlap, pdf_options).

Supported Formats (Summary)

Additional Resources

Detailed reference files for specific topics:

Python API Reference — All functions, config classes, plugin protocols, exact signatures

Node.js API Reference — All functions, TypeScript interfaces, worker pool APIs

Rust API Reference — All functions with feature gates, structs, Cargo.toml examples

CLI Reference — All commands, flags, config precedence, exit codes

Configuration Reference — TOML/YAML/JSON formats, auto-discovery, env vars, full schema

Supported Formats — All 85+ formats with file extensions and MIME types

Advanced Features — Plugins, embeddings, MCP server, API server, security limits

Other Language Bindings — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker

Full documentation: https://docs.kreuzberg.dev

GitHub: https://github.com/kreuzberg-dev/kreuzberg

kreuzberg

SKILL.md

Kreuzberg Document Extraction

Installation

Python

Node.js

Rust

CLI

Quick Start

Python (Async)

Python (Sync)

Node.js

Node.js (Sync)

Rust (Async)

Rust (Sync) — requires tokio-runtime feature

CLI

Configuration

Python (snake_case)

Node.js (camelCase)

Rust (snake_case)

Config File (TOML)

Batch Processing

Python

Node.js

Rust — requires tokio-runtime feature

CLI

OCR

Backends

Language Codes

Force OCR

ExtractionResult Fields

Error Handling

Python

Node.js

Rust

Common Pitfalls

Supported Formats (Summary)

Additional Resources

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers