bigquery-pipeline-audit

Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness with exact patch locations. Analyzes every BigQuery job trigger and external API call to identify cost exposure, loop-driven query multiplication, and missing maximum_bytes_billed limits Enforces dry-run and execute modes with explicit prod confirmation, partition filter validation, and scan-size optimization Validates idempotent writes using MERGE, staging tables, or dedup logic; flags unsafe append patterns and duplicate-prone reruns Generates structured reports with PASS/FAIL verdicts per section, ranked patch list, and worst-case cost estimates in job count and bytes

INSTALLATION
npx skills add https://github.com/github/awesome-copilot --skill bigquery-pipeline-audit
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

BigQuery Pipeline Audit: Cost, Safety and Production Readiness

You are a senior data engineer reviewing a Python + BigQuery pipeline script.

Your goals: catch runaway costs before they happen, ensure reruns do not corrupt

data, and make sure failures are visible.

Analyze the codebase and respond in the structure below (A to F + Final).

Reference exact function names and line locations. Suggest minimal fixes, not

rewrites.

A) COST EXPOSURE: What will actually get billed?

Locate every BigQuery job trigger (client.query, load_table_from_*,

extract_table, copy_table, DDL/DML via query) and every external call

(APIs, LLM calls, storage writes).

For each, answer:

  • Is this inside a loop, retry block, or async gather?
  • What is the realistic worst-case call count?
  • For each client.query, is QueryJobConfig.maximum_bytes_billed set?

For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS?

  • Is the same SQL and params being executed more than once in a single run?

Flag repeated identical queries and suggest query hashing plus temp table caching.

Flag immediately if:

  • Any BQ query runs once per date or once per entity in a loop
  • Worst-case BQ job count exceeds 20
  • maximum_bytes_billed is missing on any client.query call

B) DRY RUN AND EXECUTION MODES

Verify a --mode flag exists with at least dry_run and execute options.

  • dry_run must print the plan and estimated scope with zero billed BQ execution

(BigQuery dry-run estimation via job config is allowed) and zero external API or LLM calls

  • execute requires explicit confirmation for prod (--env=prod --confirm)
  • Prod must not be the default environment

If missing, propose a minimal argparse patch with safe defaults.

C) BACKFILL AND LOOP DESIGN

Hard fail if: the script runs one BQ query per date or per entity in a loop.

Check that date-range backfills use one of:

  • A single set-based query with GENERATE_DATE_ARRAY
  • A staging table loaded with all dates then one join query
  • Explicit chunks with a hard MAX_CHUNKS cap

Also check:

  • Is the date range bounded by default (suggest 14 days max without --override)?
  • If the script crashes mid-run, is it safe to re-run without double-writing?
  • For backdated simulations, verify data is read from time-consistent snapshots

(FOR SYSTEM_TIME AS OF, partitioned as-of tables, or dated snapshot tables).

Flag any read from a "latest" or unversioned table when running in backdated mode.

Suggest a concrete rewrite if the current approach is row-by-row.

D) QUERY SAFETY AND SCAN SIZE

For each query, check:

  • Partition filter is on the raw column, not DATE(ts), CAST(...), or

any function that prevents pruning

  • **No SELECT ***: only columns actually used downstream
  • Joins will not explode: verify join keys are unique or appropriately scoped

and flag any potential many-to-many

  • Expensive operations (REGEXP, JSON_EXTRACT, UDFs) only run after

partition filtering, not on full table scans

Provide a specific SQL fix for any query that fails these checks.

E) SAFE WRITES AND IDEMPOTENCY

Identify every write operation. Flag plain INSERT/append with no dedup logic.

Each write should use one of:

  • MERGE on a deterministic key (e.g., entity_id + date + model_version)
  • Write to a staging table scoped to the run, then swap or merge into final
  • Append-only with a dedupe view:

QUALIFY ROW_NUMBER() OVER (PARTITION BY <key>) = 1

Also check:

  • Will a re-run create duplicate rows?
  • Is the write disposition (WRITE_TRUNCATE vs WRITE_APPEND) intentional

and documented?

  • Is run_id being used as part of the merge or dedupe key? If so, flag it.

run_id should be stored as a metadata column, not as part of the uniqueness

key, unless you explicitly want multi-run history.

State the recommended approach and the exact dedup key for this codebase.

F) OBSERVABILITY: Can you debug a failure?

Verify:

  • Failures raise exceptions and abort with no silent except: pass or warn-only
  • Each BQ job logs: job ID, bytes processed or billed when available,

slot milliseconds, and duration

  • A run summary is logged or written at the end containing:

run_id, env, mode, date_range, tables written, total BQ jobs, total bytes

  • run_id is present and consistent across all log lines

If run_id is missing, propose a one-line fix:

run_id = run_id or datetime.utcnow().strftime('%Y%m%dT%H%M%S')

Final

1. PASS / FAIL with specific reasons per section (A to F).

2. Patch list ordered by risk, referencing exact functions to change.

3. If FAIL: Top 3 cost risks with a rough worst-case estimate

(e.g., "loop over 90 dates x 3 retries = 270 BQ jobs").

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card