Talent.com
Senior AI and ML HPC Cluster Engineer
Senior AI and ML HPC Cluster EngineerNVIDIA • Santa Clara, CA, United States
Senior AI and ML HPC Cluster Engineer

Senior AI and ML HPC Cluster Engineer

NVIDIA • Santa Clara, CA, United States
14 hours ago
Job type
  • Full-time
Job description

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence. Make the choice to join us today!

As a member of the GPU AI / HPC Infrastructure team, you will provide leadership in the design and implementation of ground breaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads. We seek a technical leader to identify architectural changes and / or completely new approaches for our GPU Compute Clusters. As an expert, you will help us with the strategic challenges we encounter including : compute, networking, and storage design for large scale, high performance workloads, effective resource utilization in a heterogeneous compute environment, evolving our private / public cloud strategy, capacity modeling, and growth planning across our global computing environment.

What you'll be doing :

Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage.

Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions

Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud

Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet user evolving user needs

Support our researchers to run their workloads including performance analysis and optimizations

Conduct root cause analysis and suggest corrective action Proactively find and fix issues before they occur

What we need to see :

Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience

Minimum 5+ years of experience designing and operating large scale compute infrastructure

Experience with AI / HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF

Proficient in administering Centos / RHEL and / or Ubuntu Linux distributions

Solid understanding of cluster configuration managements tools such as Ansible, Puppet, Salt

In depth understating of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud

Proficiency in Python programming and bash scripting

Applied experience with AI / HPC workflows that use MPI

Experience analyzing and tuning performance for a variety of AI / HPC workloads.

Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI / ML infrastructure fields.

Ways to stand out from the crowd :

Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking

Experience with Machine Learning and Deep Learning concepts, algorithms and models

Familiarity with InfiniBand with IBOP and RDMA

Understanding of fast, distributed storage systems like Lustre and GPFS for AI / HPC workloads

Familiarity with deep learning frameworks like PyTorch and TensorFlow

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 136,000 USD - 212,750 USD for Level 3, and 168,000 USD - 264,500 USD for Level 4.

You will also be eligible for equity and benefits () .

Applications for this job will be accepted at least until October 22, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Create a job alert for this search

Ai Ml • Santa Clara, CA, United States

Related jobs
Senior ML Platform Engineer — Scale High‑Performance AI

Senior ML Platform Engineer — Scale High‑Performance AI

Apple Inc. • Cupertino, CA, United States
Full-time
A leading technology company in Cupertino, California is seeking a Senior Software Engineer specialized in Machine Learning. The role involves designing scalable back-end systems and collaborating c...Show more
Last updated: 18 hours ago • Promoted • New!
ML Platform Engineer — Scale AI for Enterprises

ML Platform Engineer — Scale AI for Enterprises

Cerebras • San Mateo, CA, United States
Full-time
A technology company specializing in AI in San Mateo, CA seeks a Machine Learning Engineer to develop and deploy ML models and collaborate with cross-functional teams. Ideal candidates have a Bachel...Show more
Last updated: 2 days ago • Promoted
Senior AI / ML Tooling Engineer

Senior AI / ML Tooling Engineer

General Motors • San Francisco, CA, United States
Full-time
Role : We are looking for an ML tooling engineer to build tools to analyze and optimize distillation, training, and inference of ML models. You will develop and enhance GM's internal ML tooling for h...Show more
Last updated: 7 days ago • Promoted
ML / AI Engineers

ML / AI Engineers

Adidev Technologies • Oakland, California, United States
Remote
Full-time
This role is open to US Citizens, Green Card holders, GC-EAD only.Adidev is looking for an adept Machine Learning Engineer to take the helm in deploying advanced machine learning models, with a spe...Show more
Last updated: 30+ days ago • Promoted
AI / ML Computing Cluster Engineer

AI / ML Computing Cluster Engineer

SK hynix America Inc. • San Jose, CA, United States
Full-time
Job Title : AI / ML Computing Cluster Engineer.At SK hynix America, we're at the forefront of semiconductor innovation, developing advanced memory solutions that power everything from smartphones to d...Show more
Last updated: 14 hours ago • Promoted • New!
AI / ML Engineer

AI / ML Engineer

Powerline • Palo Alto, California, United States
Full-time
Join Powerline and Shape the Future of the Electricity Grid!.Powerline is a fast-growing, VC-backed climate-tech company based in Silicon Valley, dedicated to transforming the electricity grid with...Show more
Last updated: 30+ days ago • Promoted
AI / ML Engineer

AI / ML Engineer

General Motors • San Francisco, CA, United States
Full-time
As an AI / ML Engineer on the Metrics Frameworks team, part of the Simulation, Evaluation, and Data organization, you will be an individual contributor focused on developing and optimizing infrastruc...Show more
Last updated: 30+ days ago • Promoted
AI Engineer, Multimodal LLMs

AI Engineer, Multimodal LLMs

Eloquent AI • San Francisco, California, United States
Full-time
At Eloquent AI, we’re building the next generation of AI Operators—multimodal, autonomous systems that execute complex workflows across fragmented tools with human-level precision.Our technology go...Show more
Last updated: 30+ days ago • Promoted
AI Engineer

AI Engineer

Adidev Technologies • San Francisco, California, United States
Remote
Full-time
This role is open to US Citizens, Green Card holders, GC-EAD only.Adidev is looking for an adept Machine Learning Engineer to take the helm in deploying advanced machine learning models, with a spe...Show more
Last updated: 30+ days ago • Promoted
Senior Machine Learning - Core AI

Senior Machine Learning - Core AI

Roblox • San Mateo, California, United States
Full-time
As a Senior Machine Learning Engineer, you will work on challenging problems leveraging state-of-art AI / ML to empower a new generation of Roblox experiences for our growing community of users.We ar...Show more
Last updated: 30+ days ago • Promoted
Senior Staff Machine Learning Engineer, AI Engineering Tools

Senior Staff Machine Learning Engineer, AI Engineering Tools

Block • San Francisco, California, United States
Full-time
Block is one company built from many blocks, all united by the same purpose of economic empowerment.The blocks that form our foundational teams — People, Finance, Counsel, Hardware, Information Sec...Show more
Last updated: 2 days ago • Promoted
Senior AI Engineer (ML / DL)

Senior AI Engineer (ML / DL)

Volkswagen • Belmont, CA, United States
Full-time
Worldwide, the Volkswagen Group has a long tradition of dramatic innovations.The Volkswagen Group with its headquarters in Wolfsburg, Germany is one of the world's leading automobile manufacturers ...Show more
Last updated: 2 days ago • Promoted
Senior AI & ML Engineer - Build & Deploy Scalable Models

Senior AI & ML Engineer - Build & Deploy Scalable Models

4 Staffing Corp • Fremont, CA, United States
Full-time
A technology staffing agency is looking for an experienced AI / Machine Learning Engineer to develop and implement advanced machine learning models. The successful candidate will work closely with c...Show more
Last updated: 23 hours ago • Promoted
Senior AI / ML Engineer - Product-Driven AI Workflows

Senior AI / ML Engineer - Product-Driven AI Workflows

TeamEx Inc. • San Francisco, CA, United States
Full-time
A technology company is seeking an AI Engineer to design and implement user-facing AI products.This role requires a blend of strong engineering and product thinking. Responsibilities include develop...Show more
Last updated: 2 days ago • Promoted
Senior AI / ML Engineer - Developer Experience

Senior AI / ML Engineer - Developer Experience

General Motors • Sunnyvale, CA, United States
Full-time
The AI Cloud and Developer Infrastructure organization is responsible for delivering and maintaining the tools and services engineers here at GM use every day to do their best work and drive our ca...Show more
Last updated: 3 days ago • Promoted
Senior AI Engineer LLM, RAG

Senior AI Engineer LLM, RAG

A • Sunnyvale, California, United States
Full-time
Our Wayfinder team is building scalable, certifiable autonomy systems to power the next generation of commercial aircraft. Our team of experts is driving the maturation of machine learning and other...Show more
Last updated: 30+ days ago • Promoted
[2026] Senior Machine Learning Engineer, AI Platform - PhD Early Career

[2026] Senior Machine Learning Engineer, AI Platform - PhD Early Career

Roblox • San Mateo, California, United States
Full-time
The Foundation AI Group is on a mission to establish Roblox as the standard for 3D foundational models (3DFMs), democratizing creation by making it simple for anyone to generate high-quality, immer...Show more
Last updated: 2 days ago • Promoted
Senior ML Engineer - AI Core Infra (Hybrid)

Senior ML Engineer - AI Core Infra (Hybrid)

General Motors • Mountain View, CA, United States
Full-time
A leading automotive company is seeking a Senior Machine Learning Engineer to drive AI innovation.This hybrid role involves designing advanced AI systems, collaborating with engineers, and deployin...Show more
Last updated: 1 day ago • Promoted