SKILL.md
Kreuzberg Document Extraction
Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.
Use this skill when writing code that:
- Extracts text or metadata from documents
- Performs OCR on scanned documents or images
- Batch-processes multiple files
- Configures extraction options (output format, chunking, OCR, language detection)
- Implements custom plugins (post-processors, validators, OCR backends)
Installation
Python
pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr] # EasyOCR
Node.js
npm install @kreuzberg/node
Rust
# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
# embeddings, language-detection, keywords-yake, keywords-rake
CLI
# Download from GitHub releases, or:
cargo install kreuzberg-cli
Quick Start
Python (Async)
from kreuzberg import extract_file
result = await extract_file("document.pdf")
print(result.content) # extracted text
print(result.metadata) # document metadata
print(result.tables) # extracted tables
Python (Sync)
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
print(result.content)
Node.js
import { extractFile } from "@kreuzberg/node";
const result = await extractFile("document.pdf");
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);
Node.js (Sync)
import { extractFileSync } from "@kreuzberg/node";
const result = extractFileSync("document.pdf");
Rust (Async)
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
println!("{}", result.content);
Ok(())
}
Rust (Sync) — requires tokio-runtime feature
use kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
CLI
kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown
Configuration
All languages use the same configuration structure with language-appropriate naming conventions.
Python (snake_case)
from kreuzberg import (
ExtractionConfig, OcrConfig, TesseractConfig,
PdfConfig, ChunkingConfig,
)
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
),
pdf_options=PdfConfig(passwords=["secret123"]),
chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
output_format="markdown",
)
result = await extract_file("document.pdf", config=config)
Node.js (camelCase)
import { extractFile, type ExtractionConfig } from "@kreuzberg/node";
const config: ExtractionConfig = {
ocr: { backend: "tesseract", language: "eng" },
pdfOptions: { passwords: ["secret123"] },
chunking: { maxChars: 1000, maxOverlap: 200 },
outputFormat: "markdown",
};
const result = await extractFile("document.pdf", null, config);
Rust (snake_case)
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".into(),
language: "eng".into(),
..Default::default()
}),
chunking: Some(ChunkingConfig {
max_characters: 1000,
overlap: 200,
..Default::default()
}),
output_format: OutputFormat::Markdown,
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
Config File (TOML)
output_format = "markdown"
[ocr]
backend = "tesseract"
language = "eng"
[chunking]
max_chars = 1000
max_overlap = 200
[pdf_options]
passwords = ["secret123"]
# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
Batch Processing
Python
from kreuzberg import batch_extract_files, batch_extract_files_sync
# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])
# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])
for result in results:
print(f"{len(result.content)} chars extracted")
Node.js
import { batchExtractFiles } from "@kreuzberg/node";
const results = await batchExtractFiles(["doc1.pdf", "doc2.docx"]);
Rust — requires tokio-runtime feature
use kreuzberg::{batch_extract_file, ExtractionConfig};
let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;
CLI
kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown
OCR
OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).
Backends
- Tesseract (default): Built-in native binding. All Tesseract languages supported.
- EasyOCR (Python only):
pip install kreuzberg[easyocr]. Passeasyocr_kwargs={"gpu": True}.
- PaddleOCR (Python only): Bundled since 4.8.5, no extra install needed. Pass
paddleocr_kwargs={"use_angle_cls": True}.
- Guten (Node.js only): Built-in OCR backend via
GutenOcrBackend.
Language Codes
config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed
Force OCR
config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable
ExtractionResult Fields
Field
Python
Node.js
Rust
Description
Text content
result.content
result.content
result.content
Extracted text (str/String)
MIME type
result.mime_type
result.mimeType
result.mime_type
Input document MIME type
Metadata
result.metadata
result.metadata
result.metadata
Document metadata (dict/object/HashMap)
Tables
result.tables
result.tables
result.tables
Extracted tables with cells + markdown
Languages
result.detected_languages
result.detectedLanguages
result.detected_languages
Detected languages (if enabled)
Chunks
result.chunks
result.chunks
result.chunks
Text chunks (if chunking enabled)
Images
result.images
result.images
result.images
Extracted images (if enabled)
Elements
result.elements
result.elements
result.elements
Semantic elements (if element_based format)
Pages
result.pages
result.pages
result.pages
Per-page content (if page extraction enabled)
Keywords
result.keywords
result.keywords
result.keywords
Extracted keywords (if enabled)
Error Handling
Python
from kreuzberg import (
extract_file_sync, KreuzbergError, ParsingError,
OCRError, ValidationError, MissingDependencyError,
)
try:
result = extract_file_sync("file.pdf")
except ParsingError as e:
print(f"Failed to parse: {e}")
except OCRError as e:
print(f"OCR failed: {e}")
except ValidationError as e:
print(f"Invalid input: {e}")
except MissingDependencyError as e:
print(f"Missing dependency: {e}")
except KreuzbergError as e:
print(f"Extraction failed: {e}")
Node.js
import {
extractFile,
KreuzbergError,
ParsingError,
OcrError,
ValidationError,
MissingDependencyError,
} from "@kreuzberg/node";
try {
const result = await extractFile("file.pdf");
} catch (e) {
if (e instanceof ParsingError) {
/* ... */
} else if (e instanceof OcrError) {
/* ... */
} else if (e instanceof ValidationError) {
/* ... */
} else if (e instanceof KreuzbergError) {
/* ... */
}
}
Rust
use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};
let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
Ok(result) => println!("{}", result.content),
Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
Err(e) => eprintln!("Error: {e}"),
}
Common Pitfalls
- Python ChunkingConfig fields: Use
max_charsandmax_overlap, NOTmax_charactersoroverlap.
- Rust extract_file signature: Third argument is
&ExtractionConfig(a reference), notOption. Use&ExtractionConfig::default()for defaults.
- Rust feature gates:
extract_file_sync,batch_extract_file, andbatch_extract_file_syncall requirefeatures = ["tokio-runtime"]in Cargo.toml.
- Rust async context:
extract_fileis async. Use#[tokio::main]or call from an async context.
- CLI --format vs --output-format:
--formatcontrols CLI output (text/json).--output-formatcontrols content format (plain/markdown/djot/html).
- Node.js extractFile signature:
extractFile(path, mimeType?, config?)— mimeType is the second arg (passnullto skip).
- Python detect_mime_type: The function for detecting from bytes is
detect_mime_type(data). For paths usedetect_mime_type_from_path(path).
- Config file field names: Use snake_case in TOML/YAML/JSON config files (e.g.,
max_chars,max_overlap,pdf_options).
Supported Formats (Summary)
Category
Extensions
.pdf
Word
.docx, .odt
Spreadsheets
.xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods
Presentations
.pptx, .ppt, .ppsx
eBooks
.epub, .fb2
Images
.png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif, .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm, .svg
Markup
.html, .htm, .xhtml, .xml
Data
.json, .yaml, .yml, .toml, .csv, .tsv
Text
.txt, .md, .markdown, .djot, .rst, .org, .rtf
.eml, .msg
Archives
.zip, .tar, .tgz, .gz, .7z
Academic
.bib, .biblatex, .ris, .nbib, .enw, .csl, .tex, .latex, .typ, .jats, .ipynb, .docbook, .opml, .pod, .mdoc, .troff
See references/supported-formats.md for the complete format reference with MIME types.
Additional Resources
Detailed reference files for specific topics:
- Python API Reference — All functions, config classes, plugin protocols, exact signatures
- Node.js API Reference — All functions, TypeScript interfaces, worker pool APIs
- Rust API Reference — All functions with feature gates, structs, Cargo.toml examples
- CLI Reference — All commands, flags, config precedence, exit codes
- Configuration Reference — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
- Supported Formats — All 85+ formats with file extensions and MIME types
- Advanced Features — Plugins, embeddings, MCP server, API server, security limits
- Other Language Bindings — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker
Full documentation: https://docs.kreuzberg.dev