Skip to content Skip to footer

Automated End-to-End Machine Learning Pipelines with Kubeflow

Problem Statement

A fraud detection provider faced data residency compliance risks with clients and partners distributed across varying geo-locations. The business challenge was to create an automated, end-to-end machine learning pipeline, triggered by data uploads, that is reliable, scalable, and cost-effective. The solution needed to ensure complete decoupling from Databricks, respect data location regulations, maintain inference quality and share insights across geo-locations, without compromising user anonymity.

Solution

NStarX developed a robust data engineering solution using Kubeflow Pipelines to automate the ETL process, triggered seamlessly by data uploads. The solution integrated AWS S3 for scalable file storage and a DynamoDB-powered API for rapid retrieval of recently processed unique data points. To ensure scalability and efficiency, Spark on Kubernetes was employed, providing a flexible architecture capable of adapting to changes in data size and observation windows without compromising performance. The solution respected data residency regulations and user anonymity while enabling consistent inference results and reliable reporting. The new architecture ensured seamless data flow, transforming raw data into actionable insights.

Technologies & Tools

NStarX leveraged a range of cutting-edge tools and technologies, including Kubeflow Pipelines (KFP) for workflow automation, Distributed Apache Spark on Kubernetes for scalable data processing, and AWS S3 for file storage. DynamoDB enabled quick access to recent inference results, while Grafana and Prometheus were employed for monitoring machine resource usage and ensuring system reliability.

Business Outcomes

Scalability and Reliability

The new architecture seamlessly scaled with increasing data sizes and observation windows, ensuring consistent performance without downtime.

Compliance Assurance

Adherence to data residency regulations and user anonymity ensured regulatory compliance across geo-locations.

Automated Inference

Quick access to inference results through the DynamoDB-powered API enabled near instant insights, whilst respecting data residency laws and user anonymity.

Cost Optimization

By decoupling from Databricks and leveraging Spark on Kubernetes, the solution achieved an equivalent data processing speed and reliability, at a fraction of the platform cost.