Talent.com
Software Manager, AI Infrastructure System
Software Manager, AI Infrastructure SystemNVIDIA • Santa Clara, CA, United States
No longer accepting applications
Software Manager, AI Infrastructure System

Software Manager, AI Infrastructure System

NVIDIA • Santa Clara, CA, United States
30+ days ago
Job type
  • Full-time
Job description

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 fueled the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI and enabled the next era of computing. NVIDIA is a "learning machine" that constantly evolves by adapting to new opportunities that are hard to address, that matters to the world, and that only we can address. This is our life's work, to amplify human imagination and intelligence, and expand what is possible. We're seeking strategic, bold, hard-working, and creative individuals who are passionate about helping us tackle challenges no one else can solve. Make the choice to join us today.

We are looking for a n AI Infrastructure System Software Manager to join our mission to continue improving our HPC infrastructure. Our team builds and operates sophisticated infrastructure to enable business critical services and AI applications. You will be working with a team of passionate and skilled engineers that are continuously working to provide better tools to build and manage this i nfras tru cture . Ideal candidate is strong in software development, designing and creating reliable distribute d system s, and has the abi lit y to imp leme n t well though t out lo ng term maintenance strategy.

What you'll be doing :

Mentor, grow, and develop a world-class team of AI infrastructure engineers.

Work across several teams and orgs to build products that use LLMs and agent systems to serve the needs of NVIDIA engineering teams. In that role, you will be collaborating with research and infra teams and serve a large user base (hardware / software teams across NVIDIA).

Align priorities across collaborators and define metrics for measuring the success of the product / team.

Develop and execute strategies for scalable, reliable, and secure AI infrastructure supporting both research and production workloads.

Ensure robust monitoring, logging, visualization, and alerting capabilities to guarantee promised uptime and operational excellence.

Architect, design, develop, and maintain infrastructure and large-scale applications for LLM-based solutions. Optimize these systems for performance, scalability, reliability, and secure data management.

Stay updated with the latest trends in AI, ML, and infrastructure, proactively seeking opportunities to integrate advancements into Nvidia's LLM and AI infrastructure solutions.

What we need to see :

10+ overall years of industry large distributed system software development experience.

BS+ degree in CS or related / equivalent experience.

5+ years of experience managing of AI and SW development teams.

Familiarity with modern software development stacks and tools, including containerization, cloud or on-premises deployments, API integration for seamless model operation, and real-time processing frameworks.

Experience in developing and maintaining LLM or GenAI infrastructure

Excellent communication, collaboration and problem-solving skills, with a dedication to encouraging an inclusive and diverse workplace.

Hands-on experience developing large-scale distributed systems

Ways to stand out from the crowd :

Strong technical background in cloud / distributed infrastructure

Experience debugging functional and performance issues in HPC GPU clusters

Background in running and instrumenting distributed LLM training on a multi GPU HPC cluster

Experience with HPC schedulers such as Slurm

Widely considered to be one of the technology world's most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. As you plan your future, see what we can offer to you and your family

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 224,000 USD - 356,500 USD for Level 3, and 272,000 USD - 425,500 USD for Level 4.

You will also be eligible for equity and benefits .

Applications for this job will be accepted at least until July 29, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Create a job alert for this search

Software Infrastructure • Santa Clara, CA, United States

Related jobs
Staff Software Engineer, AI and Data Technology

Staff Software Engineer, AI and Data Technology

Omada Health • South San Francisco, CA, United States
Full-time
Omada Health is on a mission to inspire and engage people in lifelong health, one step at a time.The Staff Software Engineer for AI and Data Technologies will play a critical role in advancing our ...Show more
Last updated: 18 days ago • Promoted
AI Solutions Lead

AI Solutions Lead

VirtualVocations • San Francisco, California, United States
Full-time
A company is looking for an Agentic AI Solutions Lead to define and operationalize their approach to agentic AI at scale. Key Responsibilities Translate high-level AI vision into a measurable road...Show more
Last updated: 30+ days ago • Promoted
Artificial Intelligence Engineer

Artificial Intelligence Engineer

VirtualVocations • Fremont, California, United States
Full-time
A company is looking for an Artificial Intelligence Engineer.Key Responsibilities Design and implement scalable agentic solutions for diverse organizational use cases Develop and maintain a cent...Show more
Last updated: 30+ days ago • Promoted
AI Development Lead

AI Development Lead

VirtualVocations • Oakland, California, United States
Full-time
A company is looking for an AI Development Lead to architect and deliver next-generation Generative AI and Agentic systems. Key Responsibilities Architect and optimize production-grade Generative ...Show more
Last updated: 4 days ago • Promoted
AI Infrastructure Engineer

AI Infrastructure Engineer

VirtualVocations • Santa Clara, California, United States
Full-time
A company is looking for an AI Infrastructure Engineer.Key Responsibilities Develop infrastructure software and tools for large-scale AI, LLM, and GenAI infrastructure Enhance infrastructure and...Show more
Last updated: 30+ days ago • Promoted
Senior AI Engineer

Senior AI Engineer

VirtualVocations • Santa Clara, California, United States
Full-time
A company is looking for a Senior GenAI Engineer to design and implement next-generation GenAI solutions.Key Responsibilities Design and implement GenAI solutions to enhance service delivery acro...Show more
Last updated: 30+ days ago • Promoted
Program Manager, AI Enablement

Program Manager, AI Enablement

VirtualVocations • Santa Clara, California, United States
Full-time
A company is looking for a Support Enablement Program Manager specializing in Tech Fluency and AI.Key Responsibilities Lead the design, implementation, and delivery of AI and technical fluency pr...Show more
Last updated: 6 days ago • Promoted
Software Engineer, Enterprise AI

Software Engineer, Enterprise AI

Scale AI, Inc. • San Francisco, CA, United States
Full-time
Scale GP (Scale Generative AI Platform) is an enterprise-grade Generative AI platform that provides APIs for knowledge retrieval, inference, evaluation, and more. We are looking for a strong enginee...Show more
Last updated: 30+ days ago • Promoted
Senior Software Engineer, GenAI

Senior Software Engineer, GenAI

Scale AI, Inc. • San Francisco, CA, United States
Full-time
At Scale AI, our mission is to accelerate the development of AI applications.For 8 years, Scale has been the leading AI data foundry, helping fuel the most exciting advancements in AI, including : g...Show more
Last updated: 30+ days ago • Promoted
AI Architect

AI Architect

VirtualVocations • San Francisco, California, United States
Full-time
A company is looking for an Agentic AI Architect.Key Responsibilities Develop and maintain high-performing Java applications and software solutions Provide technical guidance in API development ...Show more
Last updated: 30+ days ago • Promoted
Agentic AI Architect

Agentic AI Architect

VirtualVocations • Fremont, California, United States
Full-time
A company is looking for an Agentic AI Architect.Key Responsibilities Create an AI strategy and vision for large clients / business units Architect and implement agentic AI systems integrating LLM...Show more
Last updated: 30+ days ago • Promoted
AI Infrastructure Engineer, Core Infrastructure

AI Infrastructure Engineer, Core Infrastructure

Scale AI, Inc. • San Francisco, CA, United States
Full-time
As a Software Engineer on the ML Infrastructure team, you will design and build the next generation of foundational systems that power all ML Infrastructure compute at Scale - from model training a...Show more
Last updated: 16 hours ago • Promoted • New!
AI Engineer Team Lead

AI Engineer Team Lead

VirtualVocations • Fremont, California, United States
Full-time
A company is looking for an AI Engineer and Team Lead to enhance its core product with innovative AI features.Key Responsibilities Lead and support the AI team, ensuring alignment across discipli...Show more
Last updated: 4 days ago • Promoted
AI Deployment Strategist

AI Deployment Strategist

Scale AI, Inc. • San Francisco, CA, United States
Full-time
Scale's Enterprise Applications business is growing faster than ever in the quest to.AI systems for the world's most important decisions. As an AI Deployment Strategist you will be right at the cent...Show more
Last updated: 6 days ago • Promoted
Senior Software Engineer, Enterprise GenAI

Senior Software Engineer, Enterprise GenAI

Scale AI, Inc. • San Francisco, CA, United States
Full-time
Scale GP (Scale Generative AI Platform) is an enterprise-grade Generative AI platform that provides APIs for knowledge retrieval, inference, evaluation, and more. We are looking for a strong enginee...Show more
Last updated: 30+ days ago • Promoted
AI Solutions Architect

AI Solutions Architect

VirtualVocations • Oakland, California, United States
Full-time
A company is looking for an AI Solutions Architect to design, implement, and optimize intelligent solutions within the Microsoft ecosystem. Key Responsibilities Design end-to-end AI and microservi...Show more
Last updated: 30+ days ago • Promoted
AI Lead Developer

AI Lead Developer

VirtualVocations • Santa Clara, California, United States
Full-time
A company is looking for a Gen AI Lead Developer.Key Responsibilities Leverage and build AI models Manage data pipelines and deploy models via APIs Ensure responsible AI practices Required Qua...Show more
Last updated: 1 day ago • Promoted
Microsoft Dynamics Integration Developer

Microsoft Dynamics Integration Developer

VirtualVocations • Fremont, California, United States
Full-time
A company is looking for a MS Dynamics Senior Integration Developer.Key Responsibilities Analyze integration requirements and design robust integration architectures for Microsoft Dynamics 365 an...Show more
Last updated: 3 days ago • Promoted