Building Enterprise Data Platforms

Architecture patterns for scalable, governed enterprise data platforms.

Advanced70 min · 4 lessons

Platform Capabilities and Maturity

What an Enterprise Data Platform Provides

An enterprise data platform is not a single tool—it is an integrated capability set that enables organizations to ingest, transform, govern, and consume data at scale. Mature platforms reduce time-to-insight while managing cost, risk, and compliance.

Core Capability Areas

Ingestion and orchestration: Connectors, streaming and batch pipelines, workflow scheduling (Airflow, Dagster, Step Functions), and dependency management.

Transformation engine: Warehouse-native SQL, Spark/Glue for lake processing, dbt for analytics engineering patterns.

Storage layers: Data lake (S3/ADLS), warehouse (Snowflake/BigQuery/Redshift), and optionally real-time stores (Kafka, Pinot).

Semantic and metrics layer: Canonical definitions for KPIs, dbt Semantic Layer, LookML, or dedicated metrics stores.

Governance and access: Catalog (Alation, Collibra, DataHub), classification, RBAC/ABAC, audit logging, data contracts.

Self-serve interfaces: BI tools, SQL workbenches, notebooks, and increasingly natural language interfaces grounded in governed metadata.

Observability and reliability: Freshness monitoring, incident management, SLAs, and platform health dashboards.

Maturity Stages

Organizations typically evolve through stages:

Ad hoc: Siloed databases, manual extracts, hero-driven pipelines
Foundational: Central warehouse, basic ETL, some documentation
Scaled: Lake + warehouse, orchestration standards, CI/CD for analytics code
Governed: Catalog adoption, access policies, data contracts, FinOps discipline
Productized: Internal data products, platform team with SLAs, self-serve with guardrails

Assess honestly where you sit—buying Enterprise catalog software at stage 2 yields shelfware.

Key Takeaways

Enterprise platforms integrate ingestion, transformation, storage, semantics, governance, and self-serve.
Maturity progresses from ad hoc to productized—match investments to stage.
Gap analysis by capability area clarifies platform roadmap priorities.

Reflection

Which capability is your platform's biggest gap? What stage best describes your organization today?

Reference Architecture Patterns

Patterns That Scale

Enterprise platforms combine patterns rather than inventing from scratch. Understanding reference architectures accelerates design decisions and stakeholder communication.

Lakehouse Pattern

Combines lake storage economics with warehouse-like ACID tables (Iceberg, Delta Lake, Hudi). Transformation engines (Spark, Trino) and warehouses (Snowflake external tables) query the same curated layers.

Best when: diverse workloads, ML on lake data, cost-sensitive storage at scale.

Hub-and-Spoke Warehouse

Central enterprise warehouse (Snowflake/BigQuery) as system of record for analytics with departmental marts as spokes. Ingestion hubs land data centrally; spokes enable domain autonomy within guardrails.

Best when: strong central analytics team, consistent dimensional models, BI-heavy consumption.

Data Mesh Influence

Domain teams own data products with federated governance. Central platform provides infrastructure, standards, and observability tooling; domains publish certified datasets with SLAs.

Best when: large organization, mature domains, platform team capacity for enablement not centralization.

Requires genuine domain engineering capacity—not just reorganizational labeling.

Event-Driven Real-Time Layer

Kafka or Kinesis streams feed real-time aggregates alongside batch marts. Lambda architecture or kappa simplifications merge speed and batch layers for unified serving.

Best when: fraud detection, operational dashboards, personalization with low latency requirements.

Adds operational complexity—justify with clear latency SLAs.

Hybrid Multi-Cloud

Regulatory or acquisition-driven multi-cloud deployments federate query (Presto/Trino) or replicate gold datasets across clouds. Governance and cost attribution become harder—standardize on open formats (Parquet, Iceberg) and portable orchestration.

Architecture Decision Records

Document major choices in ADRs: context, options considered, decision, consequences. Future teams understand why Snowflake was chosen over Redshift, or why Iceberg was adopted.

Key Takeaways

Lakehouse, hub-and-spoke, mesh-influenced, and event-driven patterns address different needs.
Match pattern to organizational maturity and latency requirements.
Document decisions in ADRs for long-lived platforms.

Reflection

Which reference pattern closest matches your current architecture? Where are you forcing a pattern that does not fit your maturity?

Multi-Tenant Design and Scale Considerations

Scaling Platform Services Across the Enterprise

As adoption grows, platforms must serve hundreds of engineers and thousands of datasets without becoming bottlenecks. Multi-tenant design isolates workloads while sharing infrastructure efficiently.

Compute Isolation

Separate warehouses/clusters by tenant tier or workload:

Production vs development (non-negotiable)
Interactive vs batch
Optional: dedicated resources for large business units with budget

Resource monitors and quotas prevent one tenant's runaway query from exhausting shared capacity.

Namespace and Schema Strategy

Use consistent naming: {env}_{domain}_{layer} schemas or databases. Automate provisioning via Terraform or platform APIs rather than manual clicks.

Self-serve schema creation with guardrails: templates, automatic RBAC assignment, mandatory catalog registration.

Shared Services vs Embedded Logic

Platform teams provide paved roads—dbt project templates, CI pipelines, observability baselines, ingestion SDKs. Domain teams own business logic within those guardrails.

Anti-pattern: platform team becomes bottleneck reviewing every model. Enable self-serve with automated policy checks.

Performance at Scale

Plan for:

Partition pruning and clustering on large tables
Incremental models over full refreshes
Query queue management and warehouse auto-scaling
Metadata service limits (Glue catalog API rate, Hive metastore)

Load test orchestration during peak schedule windows—6 AM when every DAG starts competes for workers.

Disaster Recovery and Business Continuity

Define RPO/RTO for Tier 1 datasets. Cross-region replication, regular restore drills, and documented failover procedures for warehouse and lake.

Clone and time travel features enable quick recovery from logical errors—practice using them before incidents.

Key Takeaways

Isolate compute by environment and workload; enforce quotas.
Automate namespace provisioning with standardized conventions.
Provide paved roads for self-serve without platform bottlenecks.
Plan metadata limits, peak load, and disaster recovery before crisis.

Reflection

What breaks first if your user count doubles—compute, orchestration, metadata, or people/process?

Platform Operating Model

Running the Platform as a Product

Technology alone does not make a platform successful. Operating model defines how teams request access, ship datasets, get support, and prioritize roadmap work.

Platform Team Responsibilities

Typical platform team owns:

Shared infrastructure (orchestration, warehouse admin, lake zones)
Standards and templates (dbt project skeleton, CI pipelines)
Observability baselines and incident escalation for platform services
Onboarding documentation and office hours
FinOps reporting and optimization recommendations

Domain teams own business logic, domain marts, and dataset SLAs within standards.

Service Catalog and SLAs

Publish internal service catalog entries:

How to request a new source ingestion (lead time, required info)
How to get a production schema (approval flow)
Platform incident response SLAs
Supported tools and versions

Transparency reduces Slack DMs and ad hoc exceptions.

Intake and Prioritization

Central intake for platform requests (Jira, ServiceNow, GitHub issues). Prioritize by:

Business impact and risk reduction
Cost savings (FinOps)
Unblocking multiple teams (multiplier)
Strategic alignment

Avoid "squeaky wheel" prioritization without visible criteria.

Enablement Over Gatekeeping

Mature platforms measure success by domains shipping independently. Training, office hours, and internal community of practice (analytics engineering guild) scale better than review boards for every change.

Gate only where risk warrants: production access, PII exposure, cross-domain sharing.

Metrics for Platform Success

Track:

Time-to-first-query for new hires
Time-to-ingest new source (request to production)
Platform incident frequency and MTTR
Developer satisfaction surveys
Adoption of standards (CI usage, catalog coverage)

Report quarterly to leadership connecting platform work to business outcomes.

Key Takeaways

Define clear boundaries between platform and domain ownership.
Publish service catalog, SLAs, and transparent prioritization criteria.
Invest in enablement to scale beyond central review bottlenecks.
Measure platform success with adoption, lead times, and satisfaction—not just uptime.

Reflection

Is your data platform operated as a product with published SLAs, or as a shared infrastructure afterthought? What one operating change would most improve domain team velocity?