content-hash-cache-pattern

Cache expensive file processing results using SHA-256 content hashes instead of file paths. Content-hash keys survive file moves and auto-invalidate when content changes, eliminating path-based cache brittleness Store cache entries as individual {hash}.json files for O(1) lookup without requiring a separate index Implement caching as a service layer wrapper around pure processing functions, keeping extraction logic separate from cache concerns Handle cache corruption gracefully by treating invalid entries as misses and re-processing on the next run

INSTALLATION
npx skills add https://github.com/affaan-m/everything-claude-code --skill content-hash-cache-pattern
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Content-Hash File Cache Pattern

Cache expensive file processing results (PDF parsing, text extraction, image analysis) using SHA-256 content hashes as cache keys. Unlike path-based caching, this approach survives file moves/renames and auto-invalidates when content changes.

When to Activate

  • Building file processing pipelines (PDF, images, text extraction)
  • Processing cost is high and same files are processed repeatedly
  • Need a --cache/--no-cache CLI option
  • Want to add caching to existing pure functions without modifying them

Core Pattern

1. Content-Hash Based Cache Key

Use file content (not path) as the cache key:

import hashlib

from pathlib import Path

_HASH_CHUNK_SIZE = 65536  # 64KB chunks for large files

def compute_file_hash(path: Path) -> str:

    """SHA-256 of file contents (chunked for large files)."""

    if not path.is_file():

        raise FileNotFoundError(f"File not found: {path}")

    sha256 = hashlib.sha256()

    with open(path, "rb") as f:

        while True:

            chunk = f.read(_HASH_CHUNK_SIZE)

            if not chunk:

                break

            sha256.update(chunk)

    return sha256.hexdigest()

Why content hash? File rename/move = cache hit. Content change = automatic invalidation. No index file needed.

2. Frozen Dataclass for Cache Entry

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)

class CacheEntry:

    file_hash: str

    source_path: str

    document: ExtractedDocument  # The cached result

3. File-Based Cache Storage

Each cache entry is stored as {hash}.json — O(1) lookup by hash, no index file required.

import json

from typing import Any

def write_cache(cache_dir: Path, entry: CacheEntry) -> None:

    cache_dir.mkdir(parents=True, exist_ok=True)

    cache_file = cache_dir / f"{entry.file_hash}.json"

    data = serialize_entry(entry)

    cache_file.write_text(json.dumps(data, ensure_ascii=False), encoding="utf-8")

def read_cache(cache_dir: Path, file_hash: str) -> CacheEntry | None:

    cache_file = cache_dir / f"{file_hash}.json"

    if not cache_file.is_file():

        return None

    try:

        raw = cache_file.read_text(encoding="utf-8")

        data = json.loads(raw)

        return deserialize_entry(data)

    except (json.JSONDecodeError, ValueError, KeyError):

        return None  # Treat corruption as cache miss

4. Service Layer Wrapper (SRP)

Keep the processing function pure. Add caching as a separate service layer.

def extract_with_cache(

    file_path: Path,

    *,

    cache_enabled: bool = True,

    cache_dir: Path = Path(".cache"),

) -> ExtractedDocument:

    """Service layer: cache check -> extraction -> cache write."""

    if not cache_enabled:

        return extract_text(file_path)  # Pure function, no cache knowledge

    file_hash = compute_file_hash(file_path)

    # Check cache

    cached = read_cache(cache_dir, file_hash)

    if cached is not None:

        logger.info("Cache hit: %s (hash=%s)", file_path.name, file_hash[:12])

        return cached.document

    # Cache miss -> extract -> store

    logger.info("Cache miss: %s (hash=%s)", file_path.name, file_hash[:12])

    doc = extract_text(file_path)

    entry = CacheEntry(file_hash=file_hash, source_path=str(file_path), document=doc)

    write_cache(cache_dir, entry)

    return doc

Key Design Decisions

Decision

Rationale

SHA-256 content hash

Path-independent, auto-invalidates on content change

{hash}.json file naming

O(1) lookup, no index file needed

Service layer wrapper

SRP: extraction stays pure, cache is a separate concern

Manual JSON serialization

Full control over frozen dataclass serialization

Corruption returns None

Graceful degradation, re-processes on next run

cache_dir.mkdir(parents=True)

Lazy directory creation on first write

Best Practices

  • Hash content, not paths — paths change, content identity doesn't
  • Chunk large files when hashing — avoid loading entire files into memory
  • Keep processing functions pure — they should know nothing about caching
  • Log cache hit/miss with truncated hashes for debugging
  • Handle corruption gracefully — treat invalid cache entries as misses, never crash

Anti-Patterns to Avoid

# BAD: Path-based caching (breaks on file move/rename)

cache = {"/path/to/file.pdf": result}

# BAD: Adding cache logic inside the processing function (SRP violation)

def extract_text(path, *, cache_enabled=False, cache_dir=None):

    if cache_enabled:  # Now this function has two responsibilities

        ...

# BAD: Using dataclasses.asdict() with nested frozen dataclasses

# (can cause issues with complex nested types)

data = dataclasses.asdict(entry)  # Use manual serialization instead

When to Use

  • File processing pipelines (PDF parsing, OCR, text extraction, image analysis)
  • CLI tools that benefit from --cache/--no-cache options
  • Batch processing where the same files appear across runs
  • Adding caching to existing pure functions without modifying them

When NOT to Use

  • Data that must always be fresh (real-time feeds)
  • Cache entries that would be extremely large (consider streaming instead)
  • Results that depend on parameters beyond file content (e.g., different extraction configs)
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card