Problem Statement
A fraud detection provider faced data residency compliance risks with clients and partners distributed across varying geo-locations. The business challenge was to create an automated, end-to-end machine learning pipeline, triggered by data uploads, that is reliable, scalable, and cost-effective. The solution needed to ensure complete decoupling from Databricks, respect data location regulations, maintain inference quality and share insights across geo-locations, without compromising user anonymity.
Solution
NStarX developed a robust data engineering solution using Kubeflow Pipelines to automate the ETL process, triggered seamlessly by data uploads. The solution integrated AWS S3 for scalable file storage and a DynamoDB-powered API for rapid retrieval of recently processed unique data points. To ensure scalability and efficiency, Spark on Kubernetes was employed, providing a flexible architecture capable of adapting to changes in data size and observation windows without compromising performance. The solution respected data residency regulations and user anonymity while enabling consistent inference results and reliable reporting. The new architecture ensured seamless data flow, transforming raw data into actionable insights.
Technologies & Tools
NStarX leveraged a range of cutting-edge tools and technologies, including Kubeflow Pipelines (KFP) for workflow automation, Distributed Apache Spark on Kubernetes for scalable data processing, and AWS S3 for file storage. DynamoDB enabled quick access to recent inference results, while Grafana and Prometheus were employed for monitoring machine resource usage and ensuring system reliability.