warehouse-init

Initialize warehouse schema discovery. Generates .astro/warehouse.md with all table metadata for instant lookups. Run once per project, refresh when schema…

INSTALLATION
npx skills add https://github.com/astronomer/agents --skill warehouse-init
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Initialize Warehouse Schema

Generate a comprehensive, user-editable schema reference file for the data warehouse.

Scripts: ../analyzing-data/scripts/ — All CLI commands below are relative to the analyzing-data skill's directory. Before running any scripts/cli.py command, cd to ../analyzing-data/ relative to this file.

What This Does

  • Discovers all databases, schemas, tables, and columns from the warehouse
  • Enriches with codebase context (dbt models, gusty SQL, schema docs)
  • Records row counts and identifies large tables
  • Generates .astro/warehouse.md - a version-controllable, team-shareable reference
  • Enables instant concept→table lookups without warehouse queries

Process

Step 1: Read Warehouse Configuration

cat ~/.astro/agents/warehouse.yml

Get the list of databases to discover (e.g., databases: [HQ, ANALYTICS, RAW]).

Step 2: Search Codebase for Context (Parallel)

Launch a subagent to find business context in code:

Task(

    subagent_type="Explore",

    prompt="""

    Search for data model documentation in the codebase:

    1. dbt models: **/models/**/*.yml, **/schema.yml

       - Extract table descriptions, column descriptions

       - Note primary keys and tests

    2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter

       - Parse frontmatter for: description, primary_key, tests

       - Note schema mappings

    3. AGENTS.md or CLAUDE.md files with data layer documentation

    Return a mapping of:

      table_name -> {description, primary_key, important_columns, layer}

    """

)

Step 3: Parallel Warehouse Discovery

Launch one subagent per database using the Task tool:

For each database in configured_databases:

    Task(

        subagent_type="general-purpose",

        prompt="""

        Discover all metadata for database {DATABASE}.

        Use the CLI to run SQL queries:

        # Scripts are relative to ../analyzing-data/

        uv run scripts/cli.py exec "df = run_sql('...')"

        uv run scripts/cli.py exec "print(df)"

        1. Query schemas:

           SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA

        2. Query tables with row counts:

           SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT

           FROM {DATABASE}.INFORMATION_SCHEMA.TABLES

           ORDER BY TABLE_SCHEMA, TABLE_NAME

        3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:

           SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT

           FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS

           WHERE TABLE_SCHEMA = 'X'

        Return a structured summary:

        - Database name

        - List of schemas with table counts

        - For each table: name, row_count, key columns

        - Flag any tables with >100M rows as "large"

        """

    )

Run all subagents in parallel (single message with multiple Task calls).

Step 4: Discover Categorical Value Families

For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families:

uv run cli.py exec "df = run_sql('''

SELECT DISTINCT column_name, COUNT(*) as occurrences

FROM table

WHERE column_name IS NOT NULL

GROUP BY column_name

ORDER BY occurrences DESC

LIMIT 50

''')"

uv run cli.py exec "print(df)"

Group related values into families by common prefix/suffix (e.g., Export* for ExportCSV, ExportJSON, ExportParquet).

Step 5: Merge Results

Combine warehouse metadata + codebase context:

  • Quick Reference table - concept → table mappings (pre-populated from code if found)
  • Categorical Columns - value families for key filter columns
  • Database sections - one per database
  • Schema subsections - tables grouped by schema
  • Table details - columns, row counts, descriptions from code, warnings

Step 6: Generate warehouse.md

Write the file to:

  • .astro/warehouse.md (default - project-specific, version-controllable)
  • ~/.astro/agents/warehouse.md (if --global flag)

Output Format

# Warehouse Schema

> Generated by `/astronomer-data:warehouse-init` on {DATE}. Edit freely to add business context.

## Quick Reference

| Concept | Table | Key Column | Date Column |

|---------|-------|------------|-------------|

| customers | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_AT |

<!-- Add your concept mappings here -->

## Categorical Columns

When filtering on these columns, explore value families first (values often have variants):

| Table | Column | Value Families |

|-------|--------|----------------|

| {TABLE} | {COLUMN} | `{PREFIX}*` ({VALUE1}, {VALUE2}, ...) |

<!-- Populated by /astronomer-data:warehouse-init from actual warehouse data -->

## Data Layer Hierarchy

Query downstream first: `reporting` > `mart_*` > `metric_*` > `model_*` > `IN_*`

| Layer | Prefix | Purpose |

|-------|--------|---------|

| Reporting | `reporting.*` | Dashboard-optimized |

| Mart | `mart_*` | Combined analytics |

| Metric | `metric_*` | KPIs at various grains |

| Model | `model_*` | Cleansed sources of truth |

| Raw | `IN_*` | Source data - avoid |

## {DATABASE} Database

### {SCHEMA} Schema

#### {TABLE_NAME}

{DESCRIPTION from code if found}

| Column | Type | Description |

|--------|------|-------------|

| COL1 | VARCHAR | {from code or inferred} |

- **Rows:** {ROW_COUNT}

- **Key column:** {PRIMARY_KEY from code or inferred}

{IF ROW_COUNT > 100M: - **⚠️ WARNING:** Large table - always add date filters}

## Relationships

{Inferred relationships based on column names like *_ID}

Command Options

Option

Effect

/astronomer-data:warehouse-init

Generate .astro/warehouse.md

/astronomer-data:warehouse-init --refresh

Regenerate, preserving user edits

/astronomer-data:warehouse-init --database HQ

Only discover specific database

/astronomer-data:warehouse-init --global

Write to ~/.astro/agents/ instead

Step 7: Pre-populate Cache

After generating warehouse.md, populate the concept cache:

# Scripts are relative to ../analyzing-data/

uv run cli.py concept import -p .astro/warehouse.md

uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID

Step 8: Offer CLAUDE.md Integration (Ask User)

Ask the user:

Would you like to add the Quick Reference table to your CLAUDE.md file?

This ensures the schema mappings are always in context for data queries, improving accuracy from ~25% to ~100% for complex queries.

Options:

  • Yes, add to CLAUDE.md (Recommended) - Append Quick Reference section
  • No, skip - Use warehouse.md and cache only

If user chooses Yes:

  • Check if .claude/CLAUDE.md or CLAUDE.md exists
  • If exists, append the Quick Reference section (avoid duplicates)
  • If not exists, create .claude/CLAUDE.md with just the Quick Reference

Quick Reference section to add:

## Data Warehouse Quick Reference

When querying the warehouse, use these table mappings:

| Concept | Table | Key Column | Date Column |

|---------|-------|------------|-------------|

{rows from warehouse.md Quick Reference}

**Large tables (always filter by date):** {list tables with >100M rows}

> Auto-generated by `/astronomer-data:warehouse-init`. Run `/astronomer-data:warehouse-init --refresh` to update.

If yes: Append the Quick Reference section to .claude/CLAUDE.md or CLAUDE.md.

After Generation

Tell the user:

Generated .astro/warehouse.md

Summary:

  - {N} databases, {N} schemas, {N} tables

  - {N} tables enriched with code descriptions

  - {N} concepts cached for instant lookup

Next steps:

  1. Edit .astro/warehouse.md to add business context

  2. Commit to version control

  3. Run /astronomer-data:warehouse-init --refresh when schema changes

Refresh Behavior

When --refresh is specified:

  • Read existing warehouse.md
  • Preserve all HTML comments (<!-- ... -->)
  • Preserve Quick Reference table entries (user-added)
  • Preserve user-added descriptions
  • Update row counts and add new tables
  • Mark removed tables with <!-- REMOVED --> comment

Cache Staleness &#x26; Schema Drift

The runtime cache has a 7-day TTL by default. After 7 days, cached entries expire and will be re-discovered on next use.

When to Refresh

Run /astronomer-data:warehouse-init --refresh when:

  • Schema changes: Tables added, renamed, or removed
  • Column changes: New columns added or types changed
  • After deployments: If your data pipeline deploys schema migrations
  • Weekly: As a good practice, even if no known changes

Signs of Stale Cache

Watch for these indicators:

  • Queries fail with "table not found" errors
  • Results seem wrong or outdated
  • New tables aren't being discovered

Manual Cache Reset

If you suspect cache issues:

# Scripts are relative to ../analyzing-data/

uv run scripts/cli.py cache status

uv run scripts/cli.py cache clear --stale-only

uv run scripts/cli.py cache clear

Codebase Patterns Recognized

Pattern

Source

What We Extract

**/models/**/*.yml

dbt

table/column descriptions, tests

**/dags/**/*.sql

gusty

YAML frontmatter (description, primary_key)

AGENTS.md, CLAUDE.md

docs

data layer hierarchy, conventions

**/docs/**/*.md

docs

business context

Example Session

User: /astronomer-data:warehouse-init

Agent:

→ Reading warehouse configuration...

→ Found 1 warehouse with databases: HQ, PRODUCT

→ Searching codebase for data documentation...

  Found: AGENTS.md with data layer hierarchy

  Found: 45 SQL files with YAML frontmatter in dags/declarative/

→ Launching parallel warehouse discovery...

  [Database: HQ] Discovering schemas...

  [Database: PRODUCT] Discovering schemas...

→ HQ: Found 29 schemas, 401 tables

→ PRODUCT: Found 1 schema, 0 tables

→ Merging warehouse metadata with code context...

  Enriched 45 tables with descriptions from code

→ Generated .astro/warehouse.md

Summary:

  - 2 databases

  - 30 schemas

  - 401 tables

  - 45 tables enriched with code descriptions

  - 8 large tables flagged (>100M rows)

Next steps:

  1. Review .astro/warehouse.md

  2. Add concept mappings to Quick Reference

  3. Commit to version control

  4. Run /astronomer-data:warehouse-init --refresh when schema changes
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card