SKILL.md

$27

Monthly aggregation with growth:

WITH monthly AS (

    SELECT

        date_trunc('month', created_at) AS month,

        COUNT(*)                        AS total_orders,

        COUNT(DISTINCT customer_id)     AS unique_customers,

        SUM(amount)                     AS revenue

    FROM orders

    WHERE created_at >= '2024-01-01'

    GROUP BY 1

),

growth AS (

    SELECT month, revenue,

        LAG(revenue) OVER (ORDER BY month) AS prev_revenue

    FROM monthly

)

SELECT month, revenue,

    ROUND((revenue - prev_revenue) / prev_revenue * 100, 1) AS growth_pct

FROM growth

ORDER BY month;

Cohort retention:

WITH first_orders AS (

    SELECT customer_id,

        date_trunc('month', MIN(created_at)) AS cohort_month

    FROM orders GROUP BY 1

),

cohort_data AS (

    SELECT f.cohort_month,

        date_trunc('month', o.created_at) AS order_month,

        COUNT(DISTINCT o.customer_id)     AS customers

    FROM orders o

    JOIN first_orders f ON o.customer_id = f.customer_id

    GROUP BY 1, 2

)

SELECT cohort_month, order_month,

    EXTRACT(MONTH FROM AGE(order_month, cohort_month)) AS months_since,

    customers

FROM cohort_data ORDER BY 1, 2;

Window functions (running total + previous order):

SELECT customer_id, order_date, amount,

    SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total,

    LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_amount

FROM orders;

Chart Selection Matrix

Data question

Best chart

Alternative

Trend over time

Line

Area

Part of whole

Donut

Stacked bar

Comparison

Bar

Column

Distribution

Histogram

Box plot

Correlation

Scatter

Heatmap

Geographic

Choropleth

Bubble map

Design rules: Start Y-axis at zero for bar charts. Use <= 7 colors. Label axes. Include benchmarks or targets for context. Avoid 3D charts and pie charts with > 5 slices.

Dashboard Layout

+------------------------------------------------------------+

| KPI CARDS: Revenue | Customers | Conversion | NPS           |

+------------------------------------------------------------+

| TREND (line chart)            | BREAKDOWN (bar chart)       |

+-------------------------------+-----------------------------+

| COMPARISON vs target/LY      | DETAIL TABLE (top N)        |

+-------------------------------+-----------------------------+

Statistical Methods

Hypothesis testing (t-test):

from scipy import stats

import numpy as np

def compare_groups(a: np.ndarray, b: np.ndarray, alpha: float = 0.05) -> dict:

    """Compare two groups; return t-stat, p-value, Cohen's d, and significance."""

    stat, p = stats.ttest_ind(a, b)

    d = (a.mean() - b.mean()) / np.sqrt((a.std()**2 + b.std()**2) / 2)

    return {"t_statistic": stat, "p_value": p, "cohens_d": d, "significant": p < alpha}

Chi-square test for independence:

def test_independence(table, alpha=0.05):

    chi2, p, dof, _ = stats.chi2_contingency(table)

    return {"chi2": chi2, "p_value": p, "dof": dof, "significant": p < alpha}

Key Business Metrics

Insight Delivery Template

## [Headline: action-oriented finding]

**What:** One-sentence description of the observation.

**So What:** Why this matters to the business (revenue, retention, cost).

**Now What:** Recommended action with expected impact.

**Evidence:** [Chart or table supporting the finding]

**Confidence:** High / Medium / Low

Analysis Framework

# Analysis: [Topic]

## Business Question -- What are we trying to answer?

## Hypothesis -- What do we expect to find?

## Data Sources -- [Source]: [Description]

## Methodology -- Numbered steps

## Findings -- Finding 1, Finding 2 (with supporting data)

## Recommendations -- [Action]: [Expected impact]

## Limitations -- Known caveats

## Next Steps -- Follow-up actions

Reference Materials

references/sql_patterns.md -- Advanced SQL queries

references/visualization.md -- Chart selection guide

references/statistics.md -- Statistical methods

references/storytelling.md -- Presentation best practices

Scripts

python scripts/query_optimizer.py --file query.sql

python scripts/query_optimizer.py --sql "SELECT * FROM orders" --json

python scripts/data_profiler.py --file sales.csv

python scripts/data_profiler.py --file data.json --top 10 --json

python scripts/report_generator.py --file sales.csv --title "Monthly Sales Report"

python scripts/report_generator.py --file data.csv --group-by region --format markdown --json

Tool Reference

Tool

Purpose

Key Flags

query_optimizer.py

Analyze SQL for anti-patterns: SELECT *, missing WHERE, cartesian joins, deep nesting, function-on-column in WHERE

--file <sql> or --sql "<query>", --json

data_profiler.py

Profile CSV/JSON datasets with per-column stats, null rates, outlier detection (IQR), and quality flags

--file <csv/json>, --top <n>, --json

report_generator.py

Generate summary reports with numeric aggregations, group-by breakdowns, and highlights

--file <csv/json>, --title, --group-by <col>, --format text/markdown, --json

Troubleshooting

Problem

Likely Cause

Resolution

SQL query runs for minutes on a table with indexes

Query uses functions on indexed columns in WHERE clause (e.g., WHERE UPPER(name) = ...)

Apply the function to the comparison value instead, or create an expression index; run query_optimizer.py to detect this pattern

data_profiler.py flags HIGH_NULL_RATE on expected optional fields

The tool flags any column with > 50% nulls regardless of business intent

Review flagged columns; suppress false positives by filtering the output or documenting expected null rates

Cohort retention query returns duplicate customers

JOIN logic counts the same customer multiple times across order items

Ensure COUNT(DISTINCT customer_id) is used and the cohort grain is correct

Bar chart Y-axis exaggerates differences

Y-axis does not start at zero

Always start bar-chart Y-axis at zero; use line charts when the baseline is not meaningful

Stakeholders challenge statistical significance

Sample size is too small or alpha threshold is unclear

Pre-register the hypothesis, calculate required sample size before analysis, and report confidence intervals alongside p-values

report_generator.py shows unexpected column as numeric

Column contains mostly numbers but includes some text codes

Clean the data upstream or pre-filter; the tool treats a column as numeric when > 80% of values parse as floats

EXPLAIN ANALYZE shows sequential scan despite index existence

Query predicates do not match the index columns or the table is too small for the planner to prefer an index

Verify index column order matches query predicates; for small tables, sequential scan may actually be faster

Success Criteria

Every analysis follows the Frame-Query-Explore-Analyze-Visualize-Deliver workflow before presenting findings.

SQL queries pass query_optimizer.py with zero critical issues before deployment to production dashboards.

Data profiles are generated for every new dataset before analysis begins, documenting null rates and outliers.

Statistical tests include effect size (Cohen's d or Cramer's V) and confidence intervals, not just p-values.

Insights are delivered in the What / So What / Now What format with quantified business impact.

Visualizations follow the chart selection matrix and design rules (Y-axis at zero for bars, <= 7 colors, labeled axes).

Reports generated by report_generator.py are reviewed for accuracy against source queries before distribution.

Scope & Limitations

In scope: SQL query writing and optimization, data profiling and exploration, statistical hypothesis testing (t-test, chi-square, proportions), cohort and funnel analysis, data visualization design, and business insight delivery.

Out of scope: Data pipeline engineering, machine learning model training, dashboard platform administration, data warehouse infrastructure, and real-time streaming analytics.

Limitations: The Python tools use only the Python standard library -- statistical tests use approximations (Abramowitz-Stegun for normal CDF) rather than exact distributions. For production-grade statistics, use scipy or statsmodels. query_optimizer.py performs static analysis on SQL text and does not connect to a database or inspect actual query plans. data_profiler.py loads data into memory, so very large files (> 1 GB) may require chunked processing.

Integration Points

Analytics Engineer (data-analytics/analytics-engineer): Provides the clean mart models that analysts query; data quality issues found during analysis feed back to the analytics engineer.

Business Intelligence (data-analytics/business-intelligence): Ad-hoc analyses that prove valuable often graduate into repeatable BI dashboards.

Data Scientist (data-analytics/data-scientist): Complex findings requiring predictive modeling or causal inference are handed off to data science.

Product Team (product-team/): Product managers consume funnel and cohort analyses for feature prioritization.

Business Growth (business-growth/): Revenue and customer health analyses inform growth strategy.

data-analyst

SKILL.md

Chart Selection Matrix

Dashboard Layout

Statistical Methods

Key Business Metrics

Insight Delivery Template

Analysis Framework

Reference Materials

Scripts

Tool Reference

Troubleshooting

Success Criteria

Scope & Limitations

Integration Points

Stop writing automation&scrapers

data-analyst

SKILL.md

Chart Selection Matrix

Dashboard Layout

Statistical Methods

Key Business Metrics

Insight Delivery Template

Analysis Framework

Reference Materials

Scripts

Tool Reference

Troubleshooting

Success Criteria

Scope &#x26; Limitations

Integration Points

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers

Scope & Limitations