Skip to content Skip to footer

Designing a Scalable AWS-Native ETL Pipeline for One-to-Many Data Consumers

Business Challenge

Business Challenge

An advertising agency in the healthcare domain was tasked with delivering deeper, more accurate insights to their clients, then those available through proprietary platforms like google analytics. To achieve this, we leveraged a single source of event logs to multiple downstream consumers, including mapping tables, aggregated analytics, dashboard views and potential inclusion of a machine learning feature store. This one-to-many consumption model introduced significant complexity when shared transformation logic changed over time.

This scenario created two major pain points:

Propagation of Changes:

A change in one core mapping (e.g. category definitions) would cascade to many outputs, forcing widespread updates. Historical dashboards, ML features, and tables could all become inconsistent until reprocessed. Recomputing everything for each change was untenable due to time and cost. The pipeline needed a way to handle mutating business logic in a timely manner without reprocessing the entire dataset each time.

Managing Scale and Cost:

The platform handled high data volume and velocity. Continuous event streams generating large daily data partitions. Reprocessing all historical data on every rule change would be extremely expensive and slow. At the same time, the business demanded fresh daily data and the ability to re-run history quickly when needed. The solution had to minimize infrastructure cost (avoiding always-on clusters or redundant work) while still delivering reliable, up-to-date data products. It also needed to be flexible for future growth (new data sources, new categorical schemes, more consumers) without a complete redesign.

In short, the goal was to support a dynamic, multi-consumer data pipeline where upstream changes propagate safely. The pipeline had to ensure that new events are processed incrementally, and when business rules mutate, only the affected historical partitions are recomputed, all without excessive compute waste or manual intervention.

Solution Architecture

To address these challenges, the engineering team designed a modular AWS-native ETL pipeline with smart orchestration to perform selective backfills:

Automated Transfer of Logs:

In AWS, all relevant logs lived in a specific log group on CloudWatch. AWS Kinesis Firehose was a simple way to copy these CloudWatch Logs into S3, and partition the saved files by year, month and day.

Automated Orchestration with Change Detection:

A scheduled AWS EventBridge trigger invokes an AWS Step Function’s state machine (the pipeline orchestrator) on a regular schedule (e.g. daily, hourly, etc.).

Ephemeral EMR Clusters for ETL Jobs:

Data processing is done with Apache Spark on Amazon EMR on EC2. The architecture spins up transient EMR clusters on demand for each run, rather than maintaining a persistent cluster.

Partitioned Data Lake Storage on S3:

All raw and processed data lives in Amazon S3, organized as a partitioned data lake. Raw events flow in via Amazon Kinesis Data Firehose, landing as compressed files in an S3 “raw” bucket, partitioned by date (year=YYYY/month=MM/day=DD).

Incremental Processing and Checkpointing:

The Spark ETL jobs were written with built-in awareness of what data to process. Each run uses change data capture principles to avoid duplicating work.

Atomic Swap for Data Updates:

When reprocessing historical data, the pipeline uses an atomic overwrite strategy to ensure consistency. The result is idempotent, safe reprocessing. Reruns pick up where they left off, and historical data updates do not corrupt current production data.

Glue Data Catalog and Athena Integration:

To support multiple consumption patterns, the processed data is made queryable via the AWS Glue Data Catalog. After the daily EMR job finishes writing new partitions, an AWS Glue Crawler (or a direct catalog update) is triggered to update table metadata.

Technology & Tools in Use

This case study solution makes heavy use of AWS services and big data tools, configured to work in unison for an efficient ETL pipeline:

  • Amazon S3
  • Amazon Kinesis Data Firehose
  • AWS Step Functions
  • AWS Lambda
  • Amazon EMR on EC2
  • Apache Spark (on EMR)
  • AWS Glue Data Catalog and Crawler
  • Amazon Athena / Redshift Spectrum

Business Outcomes

Performance Insights

Implementing this AWS-native, modular ETL pipeline yielded significant benefits for both the business and the engineering team. These business outcomes were:

  • Reduced Infrastructure Cost
  • Faster Time-to-Insight
  • Less Developer and Analyst Toil
  • Improved Data Consistency and Trust
  • Scalability and Flexibility for Expansion

Overall, the company was able to achieve timely, accurate analytics with lower cost and effort. Feature updates that used to require reloading entire databases and significant manual SQL work are now handled by the pipeline with a targeted recompute, often 10–20 times more efficient. The data engineering team can support more use cases and data consumers without linear growth in effort. This case study demonstrates how a thoughtful architecture leveraging AWS services can solve the classic challenge of one dataset feeding many needs, especially when the logic is a moving target.

Privacy Overview
NStarX Logo

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Necessary

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.