SKILL.md

Content-Hash File Cache Pattern

Cache expensive file processing results (PDF parsing, text extraction, image analysis) using SHA-256 content hashes as cache keys. Unlike path-based caching, this approach survives file moves/renames and auto-invalidates when content changes.

When to Activate

Building file processing pipelines (PDF, images, text extraction)

Processing cost is high and same files are processed repeatedly

Need a --cache/--no-cache CLI option

Want to add caching to existing pure functions without modifying them

Core Pattern

1. Content-Hash Based Cache Key

Use file content (not path) as the cache key:

import hashlib

from pathlib import Path

_HASH_CHUNK_SIZE = 65536  # 64KB chunks for large files

def compute_file_hash(path: Path) -> str:

    """SHA-256 of file contents (chunked for large files)."""

    if not path.is_file():

        raise FileNotFoundError(f"File not found: {path}")

    sha256 = hashlib.sha256()

    with open(path, "rb") as f:

        while True:

            chunk = f.read(_HASH_CHUNK_SIZE)

            if not chunk:

                break

            sha256.update(chunk)

    return sha256.hexdigest()

Why content hash? File rename/move = cache hit. Content change = automatic invalidation. No index file needed.

2. Frozen Dataclass for Cache Entry

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)

class CacheEntry:

    file_hash: str

    source_path: str

    document: ExtractedDocument  # The cached result

3. File-Based Cache Storage

Each cache entry is stored as {hash}.json — O(1) lookup by hash, no index file required.

import json

from typing import Any

def write_cache(cache_dir: Path, entry: CacheEntry) -> None:

    cache_dir.mkdir(parents=True, exist_ok=True)

    cache_file = cache_dir / f"{entry.file_hash}.json"

    data = serialize_entry(entry)

    cache_file.write_text(json.dumps(data, ensure_ascii=False), encoding="utf-8")

def read_cache(cache_dir: Path, file_hash: str) -> CacheEntry | None:

    cache_file = cache_dir / f"{file_hash}.json"

    if not cache_file.is_file():

        return None

    try:

        raw = cache_file.read_text(encoding="utf-8")

        data = json.loads(raw)

        return deserialize_entry(data)

    except (json.JSONDecodeError, ValueError, KeyError):

        return None  # Treat corruption as cache miss

4. Service Layer Wrapper (SRP)

Keep the processing function pure. Add caching as a separate service layer.

def extract_with_cache(

    file_path: Path,

    *,

    cache_enabled: bool = True,

    cache_dir: Path = Path(".cache"),

) -> ExtractedDocument:

    """Service layer: cache check -> extraction -> cache write."""

    if not cache_enabled:

        return extract_text(file_path)  # Pure function, no cache knowledge

    file_hash = compute_file_hash(file_path)

    # Check cache

    cached = read_cache(cache_dir, file_hash)

    if cached is not None:

        logger.info("Cache hit: %s (hash=%s)", file_path.name, file_hash[:12])

        return cached.document

    # Cache miss -> extract -> store

    logger.info("Cache miss: %s (hash=%s)", file_path.name, file_hash[:12])

    doc = extract_text(file_path)

    entry = CacheEntry(file_hash=file_hash, source_path=str(file_path), document=doc)

    write_cache(cache_dir, entry)

    return doc

Key Design Decisions

Decision

Rationale

SHA-256 content hash

Path-independent, auto-invalidates on content change

{hash}.json file naming

O(1) lookup, no index file needed

Service layer wrapper

SRP: extraction stays pure, cache is a separate concern

Manual JSON serialization

Full control over frozen dataclass serialization

Corruption returns None

Graceful degradation, re-processes on next run

cache_dir.mkdir(parents=True)

Lazy directory creation on first write

Best Practices

Hash content, not paths — paths change, content identity doesn't

Chunk large files when hashing — avoid loading entire files into memory

Keep processing functions pure — they should know nothing about caching

Log cache hit/miss with truncated hashes for debugging

Handle corruption gracefully — treat invalid cache entries as misses, never crash

Anti-Patterns to Avoid

# BAD: Path-based caching (breaks on file move/rename)

cache = {"/path/to/file.pdf": result}

# BAD: Adding cache logic inside the processing function (SRP violation)

def extract_text(path, *, cache_enabled=False, cache_dir=None):

    if cache_enabled:  # Now this function has two responsibilities

        ...

# BAD: Using dataclasses.asdict() with nested frozen dataclasses

# (can cause issues with complex nested types)

data = dataclasses.asdict(entry)  # Use manual serialization instead

When to Use

File processing pipelines (PDF parsing, OCR, text extraction, image analysis)

CLI tools that benefit from --cache/--no-cache options

Batch processing where the same files appear across runs

Adding caching to existing pure functions without modifying them

When NOT to Use

Data that must always be fresh (real-time feeds)

Cache entries that would be extremely large (consider streaming instead)

Results that depend on parameters beyond file content (e.g., different extraction configs)

content-hash-cache-pattern

SKILL.md

Content-Hash File Cache Pattern

When to Activate

Core Pattern

1. Content-Hash Based Cache Key

2. Frozen Dataclass for Cache Entry

3. File-Based Cache Storage

4. Service Layer Wrapper (SRP)

Key Design Decisions

Best Practices

Anti-Patterns to Avoid

When to Use

When NOT to Use

Stop writing automation&scrapers

content-hash-cache-pattern

SKILL.md

Content-Hash File Cache Pattern

When to Activate

Core Pattern

1. Content-Hash Based Cache Key

2. Frozen Dataclass for Cache Entry

3. File-Based Cache Storage

4. Service Layer Wrapper (SRP)

Key Design Decisions

Best Practices

Anti-Patterns to Avoid

When to Use

When NOT to Use

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers