MLops Engineer (Training Scalability & Workflow Optimization)
We are seeking an MLops Engineer to lead the scaling of machine learning training pipelines and ensure the robustness and efficiency of our end-to-end ML workflows. This role focuses on leveraging Flyte , Kubernetes (GPU optimization), Docker , and distributed training frameworks such as Ray to optimize and streamline our ML infrastructure.
Overview
This role focuses on leveraging Flyte , Kubernetes (GPU optimization), Docker , and distributed training frameworks such as Ray to optimize and streamline our ML infrastructure.
Responsibilities
- Workflow Orchestration : Develop and maintain ML workflows using Flyte to manage complex ML pipelines for training, testing, and deployment.
- Training Scalability : Architect and scale large-scale ML training systems on GPU-backed Kubernetes clusters , including auto-scaling and performance tuning for multi-node / multi-GPU workloads.
- Distributed Computing : Implement distributed model training pipelines using frameworks like Ray for parallelization and resource efficiency.
- Containerization : Design, build, and optimize Docker images for ML workloads with a focus on reproducibility and security.
- Resource Optimization : Debug and optimize GPU utilization, memory, and compute bottlenecks during training and inference phases.
- Monitoring & Maintenance : Integrate monitoring for ML jobs, track resource consumption, and enforce cost-efficient resource utilization.
- Collaboration : Work closely with data scientists and ML engineers to productize and scale ML experiments.
Qualifications
Strong proficiency with Kubernetes (GPU scheduling, Helm, cluster autoscaling).Hands-on experience with Flyte or similar workflow orchestration tools (Airflow, Prefect).Deep knowledge of distributed ML training (e.g., PyTorch DDP, Ray, Horovod).Expertise in Docker and container lifecycle management.Solid understanding of GPU hardware / software stack (CUDA, NCCL).Familiarity with CI / CD for ML (MLops pipelines using tools like GitHub Actions, ArgoCD).Bonus : Familiarity with observability tools for ML systems (Prometheus, Grafana).#J-18808-Ljbffr