AWS Glue and Data Lake Patterns
Common patterns for building curated data lakes with AWS Glue, S3, and the Glue Data Catalog.
Medallion Architecture on AWS
Lake Zones and Glue Jobs
The medallion architecture organizes a data lake into progressive zones of quality and refinement: bronze (raw), silver (curated), and gold (consumption-ready). On AWS, S3 holds data across zones, Glue jobs transform between them, and the Glue Data Catalog provides a Hive-compatible metastore for Athena, Spark, and Redshift Spectrum.
Bronze: Raw Landing
Bronze stores immutable copies of source data as landed—JSON from APIs, CSV extracts, CDC streams, or database snapshots. Preserve original schema and ingestion metadata (_ingested_at, _source_file).
Partition bronze by ingestion date and source system:
s3://company-lake/bronze/shopify/orders/year=2024/month=01/day=15/
Avoid transforming bronze; it is your replay and audit layer.
Silver: Curated and Conformed
Silver applies schema enforcement, deduplication, type casting, and business rules. Records failing validation route to quarantine paths for investigation rather than silently dropping.
Silver tables are the foundation for multiple gold marts. Invest in data quality tests here.
Gold: Consumption-Ready
Gold models serve specific analytics domains: customer 360, finance ledger, product catalog. Often shaped as star schemas or wide feature tables optimized for query patterns.
Gold can live in S3 (Parquet/Iceberg) queried via Athena, or sync to Redshift/Snowflake for BI workloads.
Glue Job Design
Glue jobs are PySpark or Scala scripts managed by AWS Glue. Standard structure:
- Read from catalog table or S3 path
- Validate schema against expected contract
- Transform (cleanse, join, aggregate)
- Write to target zone with partition strategy
- Update catalog partitions
Use job parameters for environment-specific paths rather than hardcoding bucket names.
Key Takeaways
- Organize S3 into bronze/silver/gold zones with clear transformation boundaries.
- Preserve raw data immutably in bronze for replay and audit.
- Enforce quality in silver; optimize shape in gold.
- Parameterize Glue jobs for portable deployment across environments.
Reflection
Where would you add schema enforcement in your lake today—bronze, silver, or both? What is the cost of enforcing too early vs too late?
Incremental Processing and Job Bookmarks
Processing Only What Changed
Full reloads of terabyte tables are expensive and slow. Incremental processing loads only new or changed records since the last successful run. AWS Glue job bookmarks track processed files or rows to enable incrementals without manual watermark management.
How Job Bookmarks Work
When enabled, Glue records state about processed S3 objects or JDBC offsets. On subsequent runs, the job reads only unprocessed data. Bookmarks persist in the Glue service—reset them carefully when logic changes require reprocessing.
Enable in job:
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
job.init(args["JOB_NAME"], args) # Required for bookmarks
And in job details: --job-bookmark-option job-bookmark-enable
Incremental Patterns
Append-only sources: Filter by ingestion timestamp or file landing time. Bookmarks track files.
CDC streams: Apply change records (insert/update/delete) using merge logic keyed on primary key. Track high-water mark on change_sequence.
Partition overwrite: For daily batch sources, overwrite the partition for execution_date only. Idempotent and simple; works well when partitions are self-contained.
Merge and Deduplication
Silver incrementals typically use merge:
from awsglue.dynamicframe import DynamicFrame
# Dedupe by primary key keeping latest by updated_at
deduped = DynamicFrame.fromDF(
df.orderBy("updated_at", ascending=False).dropDuplicates(["order_id"]),
glueContext,
"deduped",
)
For Iceberg or Delta Lake tables on S3, use MERGE INTO semantics supported by the table format for ACID upserts.
Handling Late-Arriving Data
Define a lookback window: reprocess the last N days of partitions to catch late records. Balance completeness against compute cost.
Document late-arrival SLAs with source system owners.
Key Takeaways
- Use Glue job bookmarks for file and JDBC incrementals.
- Match incremental strategy to source change pattern (append, CDC, partition).
- Deduplicate in silver with explicit key and recency logic.
- Plan lookback windows for late-arriving data.
Reflection
Which tables in your lake still full-refresh daily? Could incrementals reduce cost and runtime?
Catalog, Governance, and Lake Formation
Making the Lake Discoverable and Governed
A lake without catalog and governance is a swamp—teams cannot find datasets, access is all-or-nothing, and sensitive data proliferates unchecked.
Glue Data Catalog
The Glue Data Catalog stores table definitions, schemas, and partition metadata. Crawlers can infer schemas from S3 paths, but explicit table definitions via Infrastructure as Code (CloudFormation, Terraform) are more reliable for production.
Register partitions after each job write so Athena and Spectrum queries prune correctly:
glueContext.write_dynamic_frame.from_catalog(
frame=result,
database="silver",
table_name="orders",
additional_options={"enableUpdateCatalog": True, "updateBehavior": "UPDATE_IN_DATABASE"},
)
AWS Lake Formation
Lake Formation adds fine-grained access control on catalog resources. Define data lake administrators, grant permissions by table/column, and integrate with IAM and SSO.
Use LF-Tags for classification-based access: tag columns PII and grant analysts access to non-PII columns only.
Cross-account sharing uses Lake Formation permissions without copying data—similar in spirit to Snowflake sharing.
Integration with Downstream Tools
Athena provides serverless SQL on catalog tables—ideal for ad hoc exploration and lightweight marts.
Redshift Spectrum queries external tables registered in Glue—useful for federated queries without full loads.
Sync gold tables to warehouses (Redshift COPY, Snowflake external tables) when BI tools need lower latency or richer SQL features.
Operational Hygiene
- Version Glue scripts in Git; deploy via CI/CD
- Log job metrics (duration, records processed, quarantine counts) to CloudWatch
- Alert on job failures and quarantine volume spikes
- Review IAM roles quarterly—Glue execution roles often accumulate excessive S3 permissions
Key Takeaways
- Manage catalog tables as code; avoid schema drift from uncontrolled crawlers.
- Use Lake Formation for column-level governance and cross-account sharing.
- Integrate Athena, Spectrum, or warehouse sync based on consumer needs.
- Monitor jobs and audit access permissions regularly.
Reflection
Can an analyst on your team find and query a curated dataset without asking an engineer for the S3 path? What catalog or documentation gap blocks them?