SKILL.md
Document Processing Guide
Work with office documents: PDF, Excel, Word, and PowerPoint.
Format Overview
Format
Extension
Structure
Best For
Binary/text
Reports, forms, archives
Excel
.xlsx
XML in ZIP
Data, calculations, models
Word
.docx
XML in ZIP
Text documents, contracts
PowerPoint
.pptx
XML in ZIP
Presentations, slides
Key concept: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.
PDF Processing
PDF Tools
Task
Best Tool
Basic read/write
pypdf
Text extraction
pdfplumber
Table extraction
pdfplumber
Create PDFs
reportlab
OCR scanned PDFs
pytesseract + pdf2image
Command line
qpdf, pdftotext
Common Operations
Operation
Approach
Merge
Loop through files, add pages to writer
Split
Create new writer per page
Extract tables
Use pdfplumber, convert to DataFrame
Rotate
Call .rotate(degrees) on page
Encrypt
Use writer's .encrypt() method
OCR
Convert to images, run pytesseract
Excel Processing
Excel Tools
Task
Best Tool
Data analysis
pandas
Formulas & formatting
openpyxl
Simple CSV
pandas
Financial models
openpyxl
Critical Rule: Use Formulas
Approach
Result
Wrong: Calculate in Python, write value
Static number, breaks when data changes
Right: Write Excel formula
Dynamic, recalculates automatically
Financial Model Standards
Convention
Meaning
Blue text
Hardcoded inputs
Black text
Formulas
Green text
Links to other sheets
Yellow fill
Needs attention
Common Formula Errors
Error
Cause
#REF!
Invalid cell reference
#DIV/0!
Division by zero
#VALUE!
Wrong data type
#NAME?
Unknown function name
Word Processing
Word Tools
Task
Best Tool
Text extraction
pandoc
Create new
python-docx or docx-js
Simple edits
python-docx
Tracked changes
Direct XML editing
Document Structure
File
Contains
word/document.xml
Main content
word/comments.xml
Comments
word/media/
Images
Tracked Changes (Redlining)
Element
XML Tag
Deletion
<w:del><w:delText>...</w:delText></w:del>
Insertion
<w:ins><w:t>...</w:t></w:ins>
Key concept: For professional/legal documents, use tracked changes XML rather than replacing text directly.
PowerPoint Processing
PowerPoint Tools
Task
Best Tool
Text extraction
markitdown
Create new
pptxgenjs (JS) or python-pptx
Edit existing
Direct XML or python-pptx
Slide Structure
Path
Contains
ppt/slides/slide{N}.xml
Slide content
ppt/notesSlides/
Speaker notes
ppt/slideMasters/
Master templates
ppt/media/
Images
Design Principles
Principle
Guideline
Fonts
Use web-safe: Arial, Helvetica, Georgia
Layout
Two-column preferred, avoid vertical stacking
Hierarchy
Size, weight, color for emphasis
Consistency
Repeat patterns across slides
Converting Between Formats
Conversion
Tool
Any → PDF
LibreOffice headless
PDF → Images
pdftoppm
DOCX → Markdown
pandoc
Any → Text
Appropriate extractor
Best Practices
Practice
Why
Use formulas in Excel
Dynamic calculations
Preserve formatting on edit
Don't lose styles
Test output opens correctly
Catch corruption early
Use tracked changes for contracts
Audit trail
Extract to markdown for analysis
Easier to process
Common Packages
Language
Packages
Python
pypdf, pdfplumber, openpyxl, python-docx, python-pptx
JavaScript
docx, pptxgenjs
CLI
pandoc, qpdf, pdftotext, libreoffice