Business Challenge
Business Challenge
An advertising agency in the healthcare domain was tasked with delivering deeper, more accurate insights to their clients, then those available through proprietary platforms like google analytics. To achieve this, we leveraged a single source of event logs to multiple downstream consumers, including mapping tables, aggregated analytics, dashboard views and potential inclusion of a machine learning feature store. This one-to-many consumption model introduced significant complexity when shared transformation logic changed over time.
This scenario created two major pain points:
Propagation of Changes:
Managing Scale and Cost:
In short, the goal was to support a dynamic, multi-consumer data pipeline where upstream changes propagate safely. The pipeline had to ensure that new events are processed incrementally, and when business rules mutate, only the affected historical partitions are recomputed, all without excessive compute waste or manual intervention.
Solution Architecture
To address these challenges, the engineering team designed a modular AWS-native ETL pipeline with smart orchestration to perform selective backfills:
Automated Transfer of Logs:
Automated Orchestration with Change Detection:
Ephemeral EMR Clusters for ETL Jobs:
Partitioned Data Lake Storage on S3:
Incremental Processing and Checkpointing:
Atomic Swap for Data Updates:
Glue Data Catalog and Athena Integration:
Technology & Tools in Use
This case study solution makes heavy use of AWS services and big data tools, configured to work in unison for an efficient ETL pipeline:
- Amazon S3
- Amazon Kinesis Data Firehose
- AWS Step Functions
- AWS Lambda
- Amazon EMR on EC2
- Apache Spark (on EMR)
- AWS Glue Data Catalog and Crawler
- Amazon Athena / Redshift Spectrum
Business Outcomes
Performance Insights
Implementing this AWS-native, modular ETL pipeline yielded significant benefits for both the business and the engineering team. These business outcomes were:
- Reduced Infrastructure Cost
- Faster Time-to-Insight
- Less Developer and Analyst Toil
- Improved Data Consistency and Trust
- Scalability and Flexibility for Expansion
Overall, the company was able to achieve timely, accurate analytics with lower cost and effort. Feature updates that used to require reloading entire databases and significant manual SQL work are now handled by the pipeline with a targeted recompute, often 10–20 times more efficient. The data engineering team can support more use cases and data consumers without linear growth in effort. This case study demonstrates how a thoughtful architecture leveraging AWS services can solve the classic challenge of one dataset feeding many needs, especially when the logic is a moving target.
