SKILL.md
$27
Feature
SQL (Relational)
Distributed NoSQL (Cassandra/DynamoDB)
Data modeling
Model Entities + Relationships
Model Queries (Access Patterns)
Joins
CPU-intensive, at read time
Pre-computed (Denormalized) at write time
Storage cost
Expensive (minimize duplication)
Cheap (duplicate data for read speed)
Consistency
ACID (Strong)
BASE (Eventual) / Tunable
Scalability
Vertical (Bigger machine)
Horizontal (More nodes/shards)
The Golden Rule: In SQL, you design the data model to answer any query. In NoSQL, you design the data model to answer specific queries efficiently.
Core Design Patterns
1. Query-First Modeling (Access Patterns)
You typically cannot "add a query later" without migration or creating a new table/index.
Process:
- List all Entities (User, Order, Product).
- List all Access Patterns ("Get User by Email", "Get Orders by User sorted by Date").
- Design Table(s) specifically to serve those patterns with a single lookup.
2. The Partition Key is King
Data is distributed across physical nodes based on the Partition Key (PK).
- Goal: Even distribution of data and traffic.
- Anti-Pattern: Using a low-cardinality PK (e.g.,
status="active"orgender="m") creates Hot Partitions, limiting throughput to a single node's capacity.
- Best Practice: Use high-cardinality keys (User IDs, Device IDs, Composite Keys).
3. Clustering / Sort Keys
Within a partition, data is sorted on disk by the Clustering Key (Cassandra) or Sort Key (DynamoDB).
- This allows for efficient Range Queries (e.g.,
WHERE user_id=X AND date > Y).
- It effectively pre-sorts your data for specific retrieval requirements.
4. Single-Table Design (Adjacency Lists)
Primary use: DynamoDB (but concepts apply elsewhere)
Storing multiple entity types in one table to enable pre-joined reads.
PK (Partition)
SK (Sort)
Data Fields...
USER#123
PROFILE
{ name: "Ian", email: "..." }
USER#123
ORDER#998
{ total: 50.00, status: "shipped" }
USER#123
ORDER#999
{ total: 12.00, status: "pending" }
- Query:
PK="USER#123"
- Result: Fetches User Profile AND all Orders in one network request.
5. Denormalization & Duplication
Don't be afraid to store the same data in multiple tables to serve different query patterns.
- Table A:
users_by_id(PK: uuid)
- Table B:
users_by_email(PK: email)
Trade-off: You must manage data consistency across tables (often using eventual consistency or batch writes).
Specific Guidance
Apache Cassandra / ScyllaDB
- Primary Key Structure:
((Partition Key), Clustering Columns)
- No Joins, No Aggregates: Do not try to
JOINorGROUP BY. Pre-calculate aggregates in a separate counter table.
- **Avoid
ALLOW FILTERING:** If you see this in production, your data model is wrong. It implies a full cluster scan.
- Writes are Cheap: Inserts and Updates are just appends to the LSM tree. Don't worry about write volume as much as read efficiency.
- Tombstones: Deletes are expensive markers. Avoid high-velocity delete patterns (like queues) in standard tables.
AWS DynamoDB
- GSI (Global Secondary Index): Use GSIs to create alternative views of your data (e.g., "Search Orders by Date" instead of by User).
- Note: GSIs are eventually consistent.
- LSI (Local Secondary Index): Sorts data differently within the same partition. Must be created at table creation time.
- WCU / RCU: Understand capacity modes. Single-table design helps optimize consumed capacity units.
- TTL: Use Time-To-Live attributes to automatically expire old data (free delete) without creating tombstones.
Expert Checklist
Before finalizing your NoSQL schema:
- Access Pattern Coverage: Does every query pattern map to a specific table or index?
- Cardinality Check: Does the Partition Key have enough unique values to spread traffic evenly?
- Split Partition Risk: For any single partition (e.g., a single user's orders), will it grow indefinitely? (If > 10GB, you need to "shard" the partition, e.g.,
USER#123#2024-01).
- Consistency Requirement: Can the application tolerate eventual consistency for this read pattern?
Common Anti-Patterns
❌ Scatter-Gather: Querying all partitions to find one item (Scan).
❌ Hot Keys: Putting all "Monday" data into one partition.
❌ Relational Modeling: Creating Author and Book tables and trying to join them in code. (Instead, embed Book summaries in Author, or duplicate Author info in Books).
Limitations
- Use this skill only when the task clearly matches the scope described above.
- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.