AWS Glue and Data Lake Patterns

Common patterns for building curated data lakes with AWS Glue, S3, and the Glue Data Catalog.

Intermediate60 min · 3 lessons

Medallion Architecture on AWS

Lake Zones and Glue Jobs

The medallion architecture organizes a data lake into progressive zones of quality and refinement: bronze (raw), silver (curated), and gold (consumption-ready). On AWS, S3 holds data across zones, Glue jobs transform between them, and the Glue Data Catalog provides a Hive-compatible metastore for Athena, Spark, and Redshift Spectrum.

Bronze: Raw Landing

Bronze stores immutable copies of source data as landed—JSON from APIs, CSV extracts, CDC streams, or database snapshots. Preserve original schema and ingestion metadata (_ingested_at, _source_file).

Partition bronze by ingestion date and source system:

s3://company-lake/bronze/shopify/orders/year=2024/month=01/day=15/

Avoid transforming bronze; it is your replay and audit layer.

Silver: Curated and Conformed

Silver applies schema enforcement, deduplication, type casting, and business rules. Records failing validation route to quarantine paths for investigation rather than silently dropping.

Silver tables are the foundation for multiple gold marts. Invest in data quality tests here.

Gold: Consumption-Ready

Gold models serve specific analytics domains: customer 360, finance ledger, product catalog. Often shaped as star schemas or wide feature tables optimized for query patterns.

Gold can live in S3 (Parquet/Iceberg) queried via Athena, or sync to Redshift/Snowflake for BI workloads.

Glue Job Design

Glue jobs are PySpark or Scala scripts managed by AWS Glue. Standard structure:

Read from catalog table or S3 path
Validate schema against expected contract
Transform (cleanse, join, aggregate)
Write to target zone with partition strategy
Update catalog partitions

Use job parameters for environment-specific paths rather than hardcoding bucket names.

Key Takeaways

Organize S3 into bronze/silver/gold zones with clear transformation boundaries.
Preserve raw data immutably in bronze for replay and audit.
Enforce quality in silver; optimize shape in gold.
Parameterize Glue jobs for portable deployment across environments.

Reflection

Where would you add schema enforcement in your lake today—bronze, silver, or both? What is the cost of enforcing too early vs too late?

Incremental Processing and Job Bookmarks

Processing Only What Changed

Full reloads of terabyte tables are expensive and slow. Incremental processing loads only new or changed records since the last successful run. AWS Glue job bookmarks track processed files or rows to enable incrementals without manual watermark management.

How Job Bookmarks Work

When enabled, Glue records state about processed S3 objects or JDBC offsets. On subsequent runs, the job reads only unprocessed data. Bookmarks persist in the Glue service—reset them carefully when logic changes require reprocessing.

Enable in job:

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
job.init(args["JOB_NAME"], args)  # Required for bookmarks

And in job details: --job-bookmark-option job-bookmark-enable

Incremental Patterns

Append-only sources: Filter by ingestion timestamp or file landing time. Bookmarks track files.

CDC streams: Apply change records (insert/update/delete) using merge logic keyed on primary key. Track high-water mark on change_sequence.

Partition overwrite: For daily batch sources, overwrite the partition for execution_date only. Idempotent and simple; works well when partitions are self-contained.

Merge and Deduplication

Silver incrementals typically use merge:

from awsglue.dynamicframe import DynamicFrame

# Dedupe by primary key keeping latest by updated_at
deduped = DynamicFrame.fromDF(
    df.orderBy("updated_at", ascending=False).dropDuplicates(["order_id"]),
    glueContext,
    "deduped",
)

For Iceberg or Delta Lake tables on S3, use MERGE INTO semantics supported by the table format for ACID upserts.

Handling Late-Arriving Data

Define a lookback window: reprocess the last N days of partitions to catch late records. Balance completeness against compute cost.

Document late-arrival SLAs with source system owners.

Key Takeaways

Use Glue job bookmarks for file and JDBC incrementals.
Match incremental strategy to source change pattern (append, CDC, partition).
Deduplicate in silver with explicit key and recency logic.
Plan lookback windows for late-arriving data.

Reflection

Which tables in your lake still full-refresh daily? Could incrementals reduce cost and runtime?

Catalog, Governance, and Lake Formation

Making the Lake Discoverable and Governed

A lake without catalog and governance is a swamp—teams cannot find datasets, access is all-or-nothing, and sensitive data proliferates unchecked.

Glue Data Catalog

The Glue Data Catalog stores table definitions, schemas, and partition metadata. Crawlers can infer schemas from S3 paths, but explicit table definitions via Infrastructure as Code (CloudFormation, Terraform) are more reliable for production.

glueContext.write_dynamic_frame.from_catalog(
    frame=result,
    database="silver",
    table_name="orders",
    additional_options={"enableUpdateCatalog": True, "updateBehavior": "UPDATE_IN_DATABASE"},
)

AWS Lake Formation

Lake Formation adds fine-grained access control on catalog resources. Define data lake administrators, grant permissions by table/column, and integrate with IAM and SSO.

Use LF-Tags for classification-based access: tag columns PII and grant analysts access to non-PII columns only.

Cross-account sharing uses Lake Formation permissions without copying data—similar in spirit to Snowflake sharing.

Integration with Downstream Tools

Athena provides serverless SQL on catalog tables—ideal for ad hoc exploration and lightweight marts.

Redshift Spectrum queries external tables registered in Glue—useful for federated queries without full loads.

Sync gold tables to warehouses (Redshift COPY, Snowflake external tables) when BI tools need lower latency or richer SQL features.

Operational Hygiene

Version Glue scripts in Git; deploy via CI/CD
Log job metrics (duration, records processed, quarantine counts) to CloudWatch
Alert on job failures and quarantine volume spikes
Review IAM roles quarterly—Glue execution roles often accumulate excessive S3 permissions

Key Takeaways

Manage catalog tables as code; avoid schema drift from uncontrolled crawlers.
Use Lake Formation for column-level governance and cross-account sharing.
Integrate Athena, Spectrum, or warehouse sync based on consumer needs.
Monitor jobs and audit access permissions regularly.

Reflection

Can an analyst on your team find and query a curated dataset without asking an engineer for the S3 path? What catalog or documentation gap blocks them?