Position Overview:
We are seeking a highly skilled and motivated Data Engineer with a strong background in Machine Learning (ML) projects and extensive experience working with Databricks and Apache Spark. As a key member of our data team, you will play a crucial role in designing, implementing, and optimizing data pipelines, frameworks, and infrastructure to support ML initiatives. The ideal candidate should possess a deep understanding of data engineering concepts, excellent programming skills, and a passion for leveraging cutting-edge technologies to drive innovative ML solutions.
Responsibilities:
- Data Pipeline Development: Design, develop, and maintain robust, scalable, and efficient data
pipelines using Databricks and Spark to extract, transform, and load both structured and
unstructured data from various sources. - ML Infrastructure: Collaborate with data scientists and ML engineers to build and maintain the
infrastructure required for ML model training, evaluation, and deployment. This includes setting
up distributed computing environments, managing clusters, and optimizing resource allocation. - Data Quality and Integrity: Ensure data integrity and quality throughout the data lifecycle by
implementing data validation checks, error handling mechanisms, and data monitoring
processes. - Performance Optimization: Identify performance bottlenecks in data pipelines and Spark jobs
and work on optimizations to enhance processing speed and efficiency. - Data Governance: Implement security protocols, access controls, and data governance policies
to maintain data privacy and compliance with industry standards and regulations. - Collaboration: Work closely with cross-functional teams, including data scientists, software
engineers, and business analysts, to understand ML requirements, provide data engineering
support, and deliver successful ML projects. - Automation and Scalability: Develop automated solutions for repetitive tasks, and design data
engineering systems that can scale seamlessly to handle large volumes of data. - Troubleshooting and Support: Monitor data pipelines and Spark jobs, proactively identify
issues, and provide timely support to maintain system availability and performance. - Continuous Learning: Stay up-to-date with the latest advancements in data engineering, ML,
and cloud technologies, and actively apply this knowledge to improve existing systems and
processes.
Qualifications:
- Bachelor’s or higher degree in Computer Science, Engineering, or a related field.
- Minimum of 4 years of relevant experience as a Data Engineer working on ML projects with a
strong focus on Databricks and Spark. - Proven hands-on experience with Databricks and Apache Spark for data processing and ML
workloads in a cloud environment. - Solid understanding of data engineering principles, data modeling, ETL processes, and data
warehousing concepts. - Proficiency in programming languages like Python, Scala, or Java, and experience with SQL for
data manipulation and querying. - Familiarity with cloud platforms such as AWS, Azure, or Google Cloud Platform, and
understanding of cloud-based data storage and computing services. - Prior experience in building and managing ML infrastructure, including ML frameworks (e.g.,
TensorFlow, PyTorch), model versioning, and deployment. - Familiarity with other big data technologies like Hadoop, Hive, or Kafka is a plus.
- Strong analytical and problem-solving abilities with a keen eye for detail and a proactive
approach to identifying and resolving issues. - Excellent communication and teamwork skills to collaborate effectively with diverse teams and
convey complex technical concepts to non-technical stakeholders.
To apply for this job email your details to recruiting@nstarxinc.com