Name: querying-data-lake
Author: aws

SKILL.md

$27

Check for required tools and AWS access before running queries.

Constraints:

You MUST verify AWS MCP server tools are available (aws___call_aws) and run queries through them when present; fall back to AWS CLI only if the MCP server is unavailable

You MUST NOT fall back to shell or Bash for query execution — results must be captured via the MCP tool or aws athena CLI so output location and cost are tracked

You MUST confirm credentials with aws sts get-caller-identity and inform the user about any missing tools

2. Resolve Workgroup

Check caller identity, list workgroups, auto-select the best one (see workgroup-selection.md).

Constraints:

You MUST select a workgroup before submitting any query (prevents output-location errors)

You MUST present the selected workgroup and its output location to the user

You MUST NOT auto-escalate to a different workgroup on failure without user confirmation

3. Resolve the Target Asset

If the user refers to a table by name, by business concept ("our quarterly report", "the sales data"), by S3 path, or by catalog without specifying the table, delegate to finding-data-lake-assets to return the concrete database.table (and catalog if non-default).

Constraints:

You MUST NOT attempt to resolve fuzzy asset references with athena list-data-catalogs or by iterating get-tables — those miss federated catalogs and waste tokens

You SHOULD skip this step only when the user provides a fully-qualified reference (exact database.table) or raw SQL they want executed as-is

You MUST state the resolved asset explicitly before building the query: "Found [table] in [catalog]. Using this for the query."

You SHOULD default to the default Glue catalog unless the user mentions "federated", "Redshift", "S3 Tables", or finding-data-lake-assets returns a different catalog

4. Discover Schema

For analytical queries, You SHOULD profile the target table before building the final query. You MUST show sample rows (SELECT ... LIMIT 5) as part of profiling.

5. Build Query

Table addressing depends on catalog type:

Default Glue catalog: database.table (omit the catalog prefix for single-catalog queries). In cross-catalog queries, qualify default-catalog tables with "awsdatacatalog".database.table.

Registered data source: datasource.database.table

Unregistered Glue catalog: "catalog/subcatalog".database.table

6. Classify and Execute

Classify the SQL statement before executing:

Statement

Behavior

SELECT, SHOW, DESCRIBE, EXPLAIN

Safe — execute

INSERT, UPDATE, DELETE, DROP, ALTER, CREATE, TRUNCATE, MERGE

Destructive — warn the user and require explicit confirmation

Unsure

Treat as destructive; confirm

Example tool call (via AWS MCP server):

aws___call_aws(command="aws athena start-query-execution --work-group <WORKGROUP_NAME> --query-string '<sql>' --query-execution-context Database=<db>")

For federated or S3 Tables catalogs, also set Catalog=<CATALOG_PATH> in the execution context (e.g. Catalog=s3tablescatalog/<BUCKET_NAME>).

Constraints:

You MUST warn the user before executing when the target is Redshift-federated ("No partition pruning — every query scans the full table")

You MUST warn the user before executing a cross-catalog join ("Cross-catalog joins incur network overhead and may be slow")

You MUST confirm the output S3 location before executing

You MUST explain which tool is being called before executing

You MUST respect the user's decision to abort

7. Present and Recover

Present results with cost, data scanned, duration, and actionable insights. On failure, list available workgroups and let the user choose which to retry with.

Argument Routing

Resolve in this order; stop at the first match:

Contains SQL keywords (SELECT, SHOW, DESCRIBE, INSERT, etc.) — SQL text, execute directly

profile TABLE_NAME — run comprehensive table profiling (see query-patterns.md)

Matches a known named query — look up and execute

Matches a known workgroup — show workgroup status and recent queries

Matches a known catalog — delegate to exploring-data-catalog to enumerate databases and tables

No args — show recent query activity and available tables

Principles

Always select workgroup before executing (prevents output-location errors)

Profile unfamiliar tables before running analytical queries

Present cost alongside results so users build cost awareness

Suggest LIMIT for exploratory queries on large tables

Never ask domain questions with obvious answers, but always confirm security-relevant actions (workgroup switches, output location changes, non-SELECT statements)

Troubleshooting

Error

Cause

Fix

Redshift identifier error with mixed case

Redshift-federated names are lowercase only

Lowercase the identifier

CatalogId validation failure

ARN passed instead of catalog name

Pass the catalog name, not the ARN

Cross-catalog information_schema returns nothing

Missing catalog qualifier

Use catalog-qualified path: "catalog".information_schema.tables

Query fails with output-location error

Workgroup has no output location configured

Select a different workgroup with an output location, or configure one

Destructive statement executed without confirmation

Statement classification skipped

Always classify INSERT/UPDATE/DELETE/DROP/ALTER/CREATE/TRUNCATE/MERGE and confirm with the user

Additional Resources

Workgroup selection logic

Common query patterns

Athena best practices

Athena federated query

querying-data-lake

SKILL.md

2. Resolve Workgroup

3. Resolve the Target Asset

4. Discover Schema

5. Build Query

6. Classify and Execute

7. Present and Recover

Argument Routing

Principles

Troubleshooting

Additional Resources

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers