Building Enterprise Data Platforms
Architecture patterns for scalable, governed enterprise data platforms.
Platform Capabilities and Maturity
What an Enterprise Data Platform Provides
An enterprise data platform is not a single tool—it is an integrated capability set that enables organizations to ingest, transform, govern, and consume data at scale. Mature platforms reduce time-to-insight while managing cost, risk, and compliance.
Core Capability Areas
Ingestion and orchestration: Connectors, streaming and batch pipelines, workflow scheduling (Airflow, Dagster, Step Functions), and dependency management.
Transformation engine: Warehouse-native SQL, Spark/Glue for lake processing, dbt for analytics engineering patterns.
Storage layers: Data lake (S3/ADLS), warehouse (Snowflake/BigQuery/Redshift), and optionally real-time stores (Kafka, Pinot).
Semantic and metrics layer: Canonical definitions for KPIs, dbt Semantic Layer, LookML, or dedicated metrics stores.
Governance and access: Catalog (Alation, Collibra, DataHub), classification, RBAC/ABAC, audit logging, data contracts.
Self-serve interfaces: BI tools, SQL workbenches, notebooks, and increasingly natural language interfaces grounded in governed metadata.
Observability and reliability: Freshness monitoring, incident management, SLAs, and platform health dashboards.
Maturity Stages
Organizations typically evolve through stages:
- Ad hoc: Siloed databases, manual extracts, hero-driven pipelines
- Foundational: Central warehouse, basic ETL, some documentation
- Scaled: Lake + warehouse, orchestration standards, CI/CD for analytics code
- Governed: Catalog adoption, access policies, data contracts, FinOps discipline
- Productized: Internal data products, platform team with SLAs, self-serve with guardrails
Assess honestly where you sit—buying Enterprise catalog software at stage 2 yields shelfware.
Key Takeaways
- Enterprise platforms integrate ingestion, transformation, storage, semantics, governance, and self-serve.
- Maturity progresses from ad hoc to productized—match investments to stage.
- Gap analysis by capability area clarifies platform roadmap priorities.
Reflection
Which capability is your platform's biggest gap? What stage best describes your organization today?
Reference Architecture Patterns
Patterns That Scale
Enterprise platforms combine patterns rather than inventing from scratch. Understanding reference architectures accelerates design decisions and stakeholder communication.
Lakehouse Pattern
Combines lake storage economics with warehouse-like ACID tables (Iceberg, Delta Lake, Hudi). Transformation engines (Spark, Trino) and warehouses (Snowflake external tables) query the same curated layers.
Best when: diverse workloads, ML on lake data, cost-sensitive storage at scale.
Hub-and-Spoke Warehouse
Central enterprise warehouse (Snowflake/BigQuery) as system of record for analytics with departmental marts as spokes. Ingestion hubs land data centrally; spokes enable domain autonomy within guardrails.
Best when: strong central analytics team, consistent dimensional models, BI-heavy consumption.
Data Mesh Influence
Domain teams own data products with federated governance. Central platform provides infrastructure, standards, and observability tooling; domains publish certified datasets with SLAs.
Best when: large organization, mature domains, platform team capacity for enablement not centralization.
Requires genuine domain engineering capacity—not just reorganizational labeling.
Event-Driven Real-Time Layer
Kafka or Kinesis streams feed real-time aggregates alongside batch marts. Lambda architecture or kappa simplifications merge speed and batch layers for unified serving.
Best when: fraud detection, operational dashboards, personalization with low latency requirements.
Adds operational complexity—justify with clear latency SLAs.
Hybrid Multi-Cloud
Regulatory or acquisition-driven multi-cloud deployments federate query (Presto/Trino) or replicate gold datasets across clouds. Governance and cost attribution become harder—standardize on open formats (Parquet, Iceberg) and portable orchestration.
Architecture Decision Records
Document major choices in ADRs: context, options considered, decision, consequences. Future teams understand why Snowflake was chosen over Redshift, or why Iceberg was adopted.
Key Takeaways
- Lakehouse, hub-and-spoke, mesh-influenced, and event-driven patterns address different needs.
- Match pattern to organizational maturity and latency requirements.
- Document decisions in ADRs for long-lived platforms.
Reflection
Which reference pattern closest matches your current architecture? Where are you forcing a pattern that does not fit your maturity?
Multi-Tenant Design and Scale Considerations
Scaling Platform Services Across the Enterprise
As adoption grows, platforms must serve hundreds of engineers and thousands of datasets without becoming bottlenecks. Multi-tenant design isolates workloads while sharing infrastructure efficiently.
Compute Isolation
Separate warehouses/clusters by tenant tier or workload:
- Production vs development (non-negotiable)
- Interactive vs batch
- Optional: dedicated resources for large business units with budget
Resource monitors and quotas prevent one tenant's runaway query from exhausting shared capacity.
Namespace and Schema Strategy
Use consistent naming: {env}_{domain}_{layer} schemas or databases. Automate provisioning via Terraform or platform APIs rather than manual clicks.
Self-serve schema creation with guardrails: templates, automatic RBAC assignment, mandatory catalog registration.
Shared Services vs Embedded Logic
Platform teams provide paved roads—dbt project templates, CI pipelines, observability baselines, ingestion SDKs. Domain teams own business logic within those guardrails.
Anti-pattern: platform team becomes bottleneck reviewing every model. Enable self-serve with automated policy checks.
Performance at Scale
Plan for:
- Partition pruning and clustering on large tables
- Incremental models over full refreshes
- Query queue management and warehouse auto-scaling
- Metadata service limits (Glue catalog API rate, Hive metastore)
Load test orchestration during peak schedule windows—6 AM when every DAG starts competes for workers.
Disaster Recovery and Business Continuity
Define RPO/RTO for Tier 1 datasets. Cross-region replication, regular restore drills, and documented failover procedures for warehouse and lake.
Clone and time travel features enable quick recovery from logical errors—practice using them before incidents.
Key Takeaways
- Isolate compute by environment and workload; enforce quotas.
- Automate namespace provisioning with standardized conventions.
- Provide paved roads for self-serve without platform bottlenecks.
- Plan metadata limits, peak load, and disaster recovery before crisis.
Reflection
What breaks first if your user count doubles—compute, orchestration, metadata, or people/process?
Platform Operating Model
Running the Platform as a Product
Technology alone does not make a platform successful. Operating model defines how teams request access, ship datasets, get support, and prioritize roadmap work.
Platform Team Responsibilities
Typical platform team owns:
- Shared infrastructure (orchestration, warehouse admin, lake zones)
- Standards and templates (dbt project skeleton, CI pipelines)
- Observability baselines and incident escalation for platform services
- Onboarding documentation and office hours
- FinOps reporting and optimization recommendations
Domain teams own business logic, domain marts, and dataset SLAs within standards.
Service Catalog and SLAs
Publish internal service catalog entries:
- How to request a new source ingestion (lead time, required info)
- How to get a production schema (approval flow)
- Platform incident response SLAs
- Supported tools and versions
Transparency reduces Slack DMs and ad hoc exceptions.
Intake and Prioritization
Central intake for platform requests (Jira, ServiceNow, GitHub issues). Prioritize by:
- Business impact and risk reduction
- Cost savings (FinOps)
- Unblocking multiple teams (multiplier)
- Strategic alignment
Avoid "squeaky wheel" prioritization without visible criteria.
Enablement Over Gatekeeping
Mature platforms measure success by domains shipping independently. Training, office hours, and internal community of practice (analytics engineering guild) scale better than review boards for every change.
Gate only where risk warrants: production access, PII exposure, cross-domain sharing.
Metrics for Platform Success
Track:
- Time-to-first-query for new hires
- Time-to-ingest new source (request to production)
- Platform incident frequency and MTTR
- Developer satisfaction surveys
- Adoption of standards (CI usage, catalog coverage)
Report quarterly to leadership connecting platform work to business outcomes.
Key Takeaways
- Define clear boundaries between platform and domain ownership.
- Publish service catalog, SLAs, and transparent prioritization criteria.
- Invest in enablement to scale beyond central review bottlenecks.
- Measure platform success with adoption, lead times, and satisfaction—not just uptime.
Reflection
Is your data platform operated as a product with published SLAs, or as a shared infrastructure afterthought? What one operating change would most improve domain team velocity?