Ready to lead innovation at the intersection of platforms and artificial intelligence?
Join a pioneering technology company driving advancements in cloud, AI, and data-driven solutions across global markets. The organization is recognized for fostering innovation, scalability, and collaboration through cutting-edge platforms that empower enterprises to evolve intelligently.
The team is hiring a Head of Platform / AI Cluster Management to oversee the strategic development, integration, and optimization of AI and platform initiatives. The role will focus on leading cross-functional teams, enhancing performance and scalability, and aligning technology strategy with long-term business goals.
Shape the future of intelligent platforms and transformative innovation. Apply now!
Responsibilities
- Own the scheduler / runtime layer (Slurm, Kubernetes, Ray), including multi-tenancy, quotas, and GPU / host fleet management.
- Lead cluster operations across images, CI / CD, repair / health, performance / telemetry, and incident response.
- Deliver platform services that ensure workload SLOs and reliable runtime execution.
- Define and implement namespace / tenancy design, node health automation, golden images, admission controls, on-call runbooks, and go-live gates.
- Collaborate closely with infra, SRE, and network teams to optimize workload placement and cluster efficiency.
- Provide hands-on expertise in NCCL behaviours, placement strategies, and congestion signal management.
Requirements
Deep expertise in cluster management, scheduling, and runtime environments for large-scale compute.Hands-on background with Slurm, Kubernetes, Ray, or similar orchestration platforms.Strong understanding of NCCL performance tuning, workload isolation, and congestion management.Experience scaling multi-tenant, GPU-heavy clusters with strict SLOs.Ability to thrive in a startup environment with full ownership over platform and cluster strategy.Salary
$500,000 gross per year (Negotiable)#J-18808-Ljbffr