document-processing

Process, extract, and manipulate PDF, Excel, Word, and PowerPoint documents programmatically. Supports four major office formats (PDF, XLSX, DOCX, PPTX) with format-specific tools: pypdf and pdfplumber for PDFs, openpyxl and pandas for Excel, python-docx for Word, python-pptx for PowerPoint Core operations include text and table extraction, document merging and splitting, format conversion, and OCR for scanned PDFs Excel-specific guidance emphasizes writing formulas rather than static values for dynamic calculations, plus financial modeling conventions (color-coded text and fills) Word documents support tracked changes via XML editing for professional redlining; PowerPoint covers slide structure, speaker notes, and design principles for consistent layouts

INSTALLATION
npx skills add https://github.com/eyadsibai/ltk --skill document-processing
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Document Processing Guide

Work with office documents: PDF, Excel, Word, and PowerPoint.

Format Overview

Format

Extension

Structure

Best For

PDF

.pdf

Binary/text

Reports, forms, archives

Excel

.xlsx

XML in ZIP

Data, calculations, models

Word

.docx

XML in ZIP

Text documents, contracts

PowerPoint

.pptx

XML in ZIP

Presentations, slides

Key concept: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.

PDF Processing

PDF Tools

Task

Best Tool

Basic read/write

pypdf

Text extraction

pdfplumber

Table extraction

pdfplumber

Create PDFs

reportlab

OCR scanned PDFs

pytesseract + pdf2image

Command line

qpdf, pdftotext

Common Operations

Operation

Approach

Merge

Loop through files, add pages to writer

Split

Create new writer per page

Extract tables

Use pdfplumber, convert to DataFrame

Rotate

Call .rotate(degrees) on page

Encrypt

Use writer's .encrypt() method

OCR

Convert to images, run pytesseract

Excel Processing

Excel Tools

Task

Best Tool

Data analysis

pandas

Formulas & formatting

openpyxl

Simple CSV

pandas

Financial models

openpyxl

Critical Rule: Use Formulas

Approach

Result

Wrong: Calculate in Python, write value

Static number, breaks when data changes

Right: Write Excel formula

Dynamic, recalculates automatically

Financial Model Standards

Convention

Meaning

Blue text

Hardcoded inputs

Black text

Formulas

Green text

Links to other sheets

Yellow fill

Needs attention

Common Formula Errors

Error

Cause

#REF!

Invalid cell reference

#DIV/0!

Division by zero

#VALUE!

Wrong data type

#NAME?

Unknown function name

Word Processing

Word Tools

Task

Best Tool

Text extraction

pandoc

Create new

python-docx or docx-js

Simple edits

python-docx

Tracked changes

Direct XML editing

Document Structure

File

Contains

word/document.xml

Main content

word/comments.xml

Comments

word/media/

Images

Tracked Changes (Redlining)

Element

XML Tag

Deletion

<w:del><w:delText>...</w:delText></w:del>

Insertion

<w:ins><w:t>...</w:t></w:ins>

Key concept: For professional/legal documents, use tracked changes XML rather than replacing text directly.

PowerPoint Processing

PowerPoint Tools

Task

Best Tool

Text extraction

markitdown

Create new

pptxgenjs (JS) or python-pptx

Edit existing

Direct XML or python-pptx

Slide Structure

Path

Contains

ppt/slides/slide{N}.xml

Slide content

ppt/notesSlides/

Speaker notes

ppt/slideMasters/

Master templates

ppt/media/

Images

Design Principles

Principle

Guideline

Fonts

Use web-safe: Arial, Helvetica, Georgia

Layout

Two-column preferred, avoid vertical stacking

Hierarchy

Size, weight, color for emphasis

Consistency

Repeat patterns across slides

Converting Between Formats

Conversion

Tool

Any → PDF

LibreOffice headless

PDF → Images

pdftoppm

DOCX → Markdown

pandoc

Any → Text

Appropriate extractor

Best Practices

Practice

Why

Use formulas in Excel

Dynamic calculations

Preserve formatting on edit

Don't lose styles

Test output opens correctly

Catch corruption early

Use tracked changes for contracts

Audit trail

Extract to markdown for analysis

Easier to process

Common Packages

Language

Packages

Python

pypdf, pdfplumber, openpyxl, python-docx, python-pptx

JavaScript

docx, pptxgenjs

CLI

pandoc, qpdf, pdftotext, libreoffice

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card