ingesting-into-data-lake

>-

INSTALLATION
npx skills add https://github.com/aws/agent-toolkit-for-aws --skill ingesting-into-data-lake
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$2c

2. Classify the Source

User says...

Source type

Reference

"upload my file", "local CSV", "move to S3"

Local file

local-upload.md

"load from S3", "import CSV/JSON/Parquet from s3://"

S3 files

s3-files.md

"import from Oracle/Postgres/MySQL/SQL Server/Redshift/RDS/Aurora"

JDBC

jdbc-ingest.md

"pull from Snowflake", "Snowflake table to S3"

Snowflake

snowflake-ingest.md

"import from BigQuery", "GCP analytics to S3"

BigQuery

bigquery-ingest.md

"export DynamoDB", "DynamoDB to data lake"

DynamoDB

dynamodb-ingest.md

"migrate Glue table", "convert Hive to Iceberg"

Catalog migration

catalog-migration.md

If the user names Salesforce, ServiceNow, SAP, MongoDB, Kafka, or another SaaS/streaming source, decline -- these are not supported in this release.

If the source table is referenced by a fuzzy or business name ("migrate our orders table", "pull from the sales warehouse"), delegate to finding-data-lake-assets to resolve before proceeding.

3. Confirm Connection Exists (if applicable)

For JDBC, Snowflake, and BigQuery sources, a Glue connection is required. Check:

aws glue get-connection --name <CONNECTION_NAME> --region <REGION>

If the connection does not exist, stop and delegate to connecting-to-data-source to create and test it. Do not proceed with ingest until the connection is verified.

Local files, S3 files, DynamoDB, and catalog migration do not need a Glue connection.

4. Clarify the Target

You MUST ask the user (or suggest based on catalog inventory) before creating or writing to any table:

  • Database/namespace: Does a specific target database exist? Or should one be created?
  • Table: Existing table (append/merge) or new table (delegate to creating-data-lake-table)?
  • Format: S3 Tables (default), standard Iceberg, or raw Parquet?

Inventory-aware defaults:

If you have already run exploring-data-catalog or can quickly check, use what exists:

  • Account has an s3tablescatalog federated catalog and active table buckets: recommend S3 Tables
  • Account has general-purpose buckets with Iceberg tables and no S3 Tables usage: recommend standard Iceberg on their existing bucket
  • Account uses Parquet/ORC on S3 without Iceberg metadata: ask whether to adopt Iceberg now (recommend yes) or continue with raw files

Do not force S3 Tables on customers who haven't adopted it. See iceberg-catalog-config-and-usage.md.

Delegations from this step:

  • Target table doesn't exist -> creating-data-lake-table
  • Target database named by fuzzy term -> finding-data-lake-assets
  • User doesn't know what exists -> exploring-data-catalog

5. Execute Source Workflow

Read the source-specific reference and follow its phases. Each is self-contained with job templates, gotchas, and troubleshooting:

  • Local / S3 / JDBC / Snowflake / BigQuery / DynamoDB / catalog migration -- one reference per source

Common Glue 5.1 or higher job configuration and PySpark templates are shared in glue-job-config.md and glue-job-scripts.md.

6. Validate

Run all three, do not skip:

  • Row count matches expected (source vs target)
  • Null check on critical columns
  • Spot-check 3-5 sample rows

See data-quality-validation.md.

7. Schedule (if recurring)

For recurring pipelines, create a Glue Trigger with a cron schedule. See testing-and-scheduling.md. Simple single-step pipelines use Glue Triggers; multi-step with branching uses MWAA.

Argument Routing

  • S3 path only: Infer one-time load, start Step 2 with S3 files
  • Connection name: Start Step 3 with the named connection
  • Table name: Start Step 4, ask whether this is source or target
  • --target flag: Pre-fill the target format in Step 4
  • No args: Walk through interactively

Gotchas

  • S3 Tables requires Glue 5.1 or higher and --datalake-formats iceberg job argument
  • All spark.sql.catalog.* config MUST go in --conf job arguments, never in spark.conf.set(). Glue 5.x throws AnalysisException: Cannot modify the value of a static config otherwise. See iceberg-catalog-config-and-usage.md for correct catalog configs.
  • The warehouse parameter is required in S3 Tables catalog config. Without it Spark fails with "Cannot derive default warehouse location".
  • Table and column names in S3 Tables MUST be all lowercase
  • overwritePartitions() only replaces partitions present in the DataFrame -- for full refresh with deletes, use createOrReplace()
  • Standard Iceberg targets MUST include a LOCATION clause; S3 Tables MUST NOT
  • DynamoDB does not need a Glue connection -- do not attempt to create one
  • Connection failures during ingest delegate back to connecting-to-data-source; do not debug network/credentials in this skill
  • For target tables in SageMaker Unified Studio projects, ensure the project role has write access to the target namespace before the Glue job runs

Troubleshooting

Error

Likely cause

Action

Access Denied on S3

Missing IAM permissions

Check Glue role has s3:GetObject, s3:PutObject

Access Denied on S3 Tables

Missing s3tables:* permissions

Add S3 Tables inline policy to Glue role

CTAS timeout

Dataset too large for Athena

Switch to Glue ETL or batch with WHERE filters

JDBC connection timeout/auth failure

Connection-level issue

Delegate to connecting-to-data-source

Throughput exceeded (DynamoDB)

Read percent too high

Lower read.percent or use native export

See error-handling.md for the full catalog.

References

Source-specific

  • s3-files.md -- S3 files (CSV, JSON, Parquet, Avro, ORC)
  • jdbc-ingest.md -- Oracle, SQL Server, PostgreSQL, MySQL, RDS, Aurora, Redshift

Cross-cutting

  • error-handling.md -- Ingest errors (connection errors delegate to connecting-to-data-source)

Migration-specific

JDBC-specific

BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card