kreuzberg

>-

INSTALLATION
npx skills add https://github.com/kreuzberg-dev/kreuzberg --skill kreuzberg
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Kreuzberg Document Extraction

Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.

Use this skill when writing code that:

  • Extracts text or metadata from documents
  • Performs OCR on scanned documents or images
  • Batch-processes multiple files
  • Configures extraction options (output format, chunking, OCR, language detection)
  • Implements custom plugins (post-processors, validators, OCR backends)

Installation

Python

pip install kreuzberg

# Optional OCR backends:

pip install kreuzberg[easyocr]    # EasyOCR

Node.js

npm install @kreuzberg/node

Rust

# Cargo.toml

[dependencies]

kreuzberg = { version = "4", features = ["tokio-runtime"] }

# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,

#           embeddings, language-detection, keywords-yake, keywords-rake

CLI

# Download from GitHub releases, or:

cargo install kreuzberg-cli

Quick Start

Python (Async)

from kreuzberg import extract_file

result = await extract_file("document.pdf")

print(result.content)       # extracted text

print(result.metadata)      # document metadata

print(result.tables)        # extracted tables

Python (Sync)

from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")

print(result.content)

Node.js

import { extractFile } from "@kreuzberg/node";

const result = await extractFile("document.pdf");

console.log(result.content);

console.log(result.metadata);

console.log(result.tables);

Node.js (Sync)

import { extractFileSync } from "@kreuzberg/node";

const result = extractFileSync("document.pdf");

Rust (Async)

use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]

async fn main() -> kreuzberg::Result<()> {

    let config = ExtractionConfig::default();

    let result = extract_file("document.pdf", None, &#x26;config).await?;

    println!("{}", result.content);

    Ok(())

}

Rust (Sync) — requires tokio-runtime feature

use kreuzberg::{extract_file_sync, ExtractionConfig};

fn main() -> kreuzberg::Result<()> {

    let config = ExtractionConfig::default();

    let result = extract_file_sync("document.pdf", None, &#x26;config)?;

    println!("{}", result.content);

    Ok(())

}

CLI

kreuzberg extract document.pdf

kreuzberg extract document.pdf --format json

kreuzberg extract document.pdf --output-format markdown

Configuration

All languages use the same configuration structure with language-appropriate naming conventions.

Python (snake_case)

from kreuzberg import (

    ExtractionConfig, OcrConfig, TesseractConfig,

    PdfConfig, ChunkingConfig,

)

config = ExtractionConfig(

    ocr=OcrConfig(

        backend="tesseract",

        language="eng",

        tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),

    ),

    pdf_options=PdfConfig(passwords=["secret123"]),

    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),

    output_format="markdown",

)

result = await extract_file("document.pdf", config=config)

Node.js (camelCase)

import { extractFile, type ExtractionConfig } from "@kreuzberg/node";

const config: ExtractionConfig = {

  ocr: { backend: "tesseract", language: "eng" },

  pdfOptions: { passwords: ["secret123"] },

  chunking: { maxChars: 1000, maxOverlap: 200 },

  outputFormat: "markdown",

};

const result = await extractFile("document.pdf", null, config);

Rust (snake_case)

use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};

let config = ExtractionConfig {

    ocr: Some(OcrConfig {

        backend: "tesseract".into(),

        language: "eng".into(),

        ..Default::default()

    }),

    chunking: Some(ChunkingConfig {

        max_characters: 1000,

        overlap: 200,

        ..Default::default()

    }),

    output_format: OutputFormat::Markdown,

    ..Default::default()

};

let result = extract_file("document.pdf", None, &#x26;config).await?;

Config File (TOML)

output_format = "markdown"

[ocr]

backend = "tesseract"

language = "eng"

[chunking]

max_chars = 1000

max_overlap = 200

[pdf_options]

passwords = ["secret123"]
# CLI: auto-discovers kreuzberg.toml in current/parent directories

kreuzberg extract doc.pdf

# or explicit:

kreuzberg extract doc.pdf --config kreuzberg.toml

kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'

Batch Processing

Python

from kreuzberg import batch_extract_files, batch_extract_files_sync

# Async

results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])

# Sync

results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])

for result in results:

    print(f"{len(result.content)} chars extracted")

Node.js

import { batchExtractFiles } from "@kreuzberg/node";

const results = await batchExtractFiles(["doc1.pdf", "doc2.docx"]);

Rust — requires tokio-runtime feature

use kreuzberg::{batch_extract_file, ExtractionConfig};

let config = ExtractionConfig::default();

let paths = vec!["doc1.pdf", "doc2.docx"];

let results = batch_extract_file(paths, &#x26;config).await?;

CLI

kreuzberg batch *.pdf --format json

kreuzberg batch docs/*.docx --output-format markdown

OCR

OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).

Backends

  • Tesseract (default): Built-in native binding. All Tesseract languages supported.
  • EasyOCR (Python only): pip install kreuzberg[easyocr]. Pass easyocr_kwargs={"gpu": True}.
  • PaddleOCR (Python only): Bundled since 4.8.5, no extra install needed. Pass paddleocr_kwargs={"use_angle_cls": True}.
  • Guten (Node.js only): Built-in OCR backend via GutenOcrBackend.

Language Codes

config = ExtractionConfig(ocr=OcrConfig(language="eng"))       # English

config = ExtractionConfig(ocr=OcrConfig(language="eng+deu"))   # Multiple

config = ExtractionConfig(ocr=OcrConfig(language="all"))       # All installed

Force OCR

config = ExtractionConfig(force_ocr=True)  # OCR even if text is extractable

ExtractionResult Fields

Field

Python

Node.js

Rust

Description

Text content

result.content

result.content

result.content

Extracted text (str/String)

MIME type

result.mime_type

result.mimeType

result.mime_type

Input document MIME type

Metadata

result.metadata

result.metadata

result.metadata

Document metadata (dict/object/HashMap)

Tables

result.tables

result.tables

result.tables

Extracted tables with cells + markdown

Languages

result.detected_languages

result.detectedLanguages

result.detected_languages

Detected languages (if enabled)

Chunks

result.chunks

result.chunks

result.chunks

Text chunks (if chunking enabled)

Images

result.images

result.images

result.images

Extracted images (if enabled)

Elements

result.elements

result.elements

result.elements

Semantic elements (if element_based format)

Pages

result.pages

result.pages

result.pages

Per-page content (if page extraction enabled)

Keywords

result.keywords

result.keywords

result.keywords

Extracted keywords (if enabled)

Error Handling

Python

from kreuzberg import (

    extract_file_sync, KreuzbergError, ParsingError,

    OCRError, ValidationError, MissingDependencyError,

)

try:

    result = extract_file_sync("file.pdf")

except ParsingError as e:

    print(f"Failed to parse: {e}")

except OCRError as e:

    print(f"OCR failed: {e}")

except ValidationError as e:

    print(f"Invalid input: {e}")

except MissingDependencyError as e:

    print(f"Missing dependency: {e}")

except KreuzbergError as e:

    print(f"Extraction failed: {e}")

Node.js

import {

  extractFile,

  KreuzbergError,

  ParsingError,

  OcrError,

  ValidationError,

  MissingDependencyError,

} from "@kreuzberg/node";

try {

  const result = await extractFile("file.pdf");

} catch (e) {

  if (e instanceof ParsingError) {

    /* ... */

  } else if (e instanceof OcrError) {

    /* ... */

  } else if (e instanceof ValidationError) {

    /* ... */

  } else if (e instanceof KreuzbergError) {

    /* ... */

  }

}

Rust

use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};

let config = ExtractionConfig::default();

match extract_file("file.pdf", None, &#x26;config).await {

    Ok(result) => println!("{}", result.content),

    Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),

    Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),

    Err(e) => eprintln!("Error: {e}"),

}

Common Pitfalls

  • Python ChunkingConfig fields: Use max_chars and max_overlap, NOT max_characters or overlap.
  • Rust extract_file signature: Third argument is &#x26;ExtractionConfig (a reference), not Option. Use &#x26;ExtractionConfig::default() for defaults.
  • Rust feature gates: extract_file_sync, batch_extract_file, and batch_extract_file_sync all require features = ["tokio-runtime"] in Cargo.toml.
  • Rust async context: extract_file is async. Use #[tokio::main] or call from an async context.
  • CLI --format vs --output-format: --format controls CLI output (text/json). --output-format controls content format (plain/markdown/djot/html).
  • Node.js extractFile signature: extractFile(path, mimeType?, config?) — mimeType is the second arg (pass null to skip).
  • Python detect_mime_type: The function for detecting from bytes is detect_mime_type(data). For paths use detect_mime_type_from_path(path).
  • Config file field names: Use snake_case in TOML/YAML/JSON config files (e.g., max_chars, max_overlap, pdf_options).

Supported Formats (Summary)

Category

Extensions

PDF

.pdf

Word

.docx, .odt

Spreadsheets

.xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods

Presentations

.pptx, .ppt, .ppsx

eBooks

.epub, .fb2

Images

.png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif, .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm, .svg

Markup

.html, .htm, .xhtml, .xml

Data

.json, .yaml, .yml, .toml, .csv, .tsv

Text

.txt, .md, .markdown, .djot, .rst, .org, .rtf

Email

.eml, .msg

Archives

.zip, .tar, .tgz, .gz, .7z

Academic

.bib, .biblatex, .ris, .nbib, .enw, .csl, .tex, .latex, .typ, .jats, .ipynb, .docbook, .opml, .pod, .mdoc, .troff

See references/supported-formats.md for the complete format reference with MIME types.

Additional Resources

Detailed reference files for specific topics:

  • CLI Reference — All commands, flags, config precedence, exit codes

Full documentation: https://docs.kreuzberg.dev

GitHub: https://github.com/kreuzberg-dev/kreuzberg

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card