Talent.com
AI Infra Engineer

AI Infra Engineer

Perplexity AI Inc.San Francisco, CA, United States
4 days ago
Job type
  • Full-time
Job description

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters

Responsibilities

Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads

Manage and optimize Slurm-based HPC environments for distributed training of large language models

Develop robust APIs and orchestration systems for both training pipelines and inference services

Implement resource scheduling and job management systems across heterogeneous compute environments

Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure

Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm

Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services

Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Qualifications

Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management

Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization

Experience with deploying and managing distributed training systems at scale

Deep understanding of container orchestration and distributed systems architecture

High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi / Grouped-Query, distributed training strategies)

Experience managing GPU clusters and optimizing compute resource utilization

Required Skills

Expert-level Kubernetes administration and YAML configuration management

Proficiency with Slurm job scheduling, resource management, and cluster configuration

Python and C++ programming with focus on systems and infrastructure automation

Hands-on experience with ML frameworks such as PyTorch in distributed training contexts

Strong understanding of networking, storage, and compute resource management for ML workloads

Experience developing APIs and managing distributed systems for both batch and real-time workloads

Solid debugging and monitoring skills with expertise in observability tools for containerized environments

Preferred Skills

Experience with Kubernetes operators and custom controllers for ML workloads

Advanced Slurm administration including multi-cluster federation and advanced scheduling policies

Familiarity with GPU cluster management and CUDA optimization

Experience with other ML frameworks like TensorFlow or distributed training libraries

Background in HPC environments, parallel computing, and high-performance networking

Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices

Experience with container registries, image optimization, and multi-stage builds for ML workloads

Required Experience

Demonstrated experience managing large-scale Kubernetes deployments in production environments

Proven track record with Slurm cluster administration and HPC workload management

Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure

Experience supporting both long-running training jobs and high-availability inference services

Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

The cash compensation range for this role is $190,000 - $250,000.

Final offer amounts are determined by multiple factors, including, experience and expertise, and may vary from the amounts listed above.

Equity : In addition to the base salary, equity may be part of the total compensation package.

Benefits : Comprehensive health, dental, and vision insurance for you and your dependents. Includes a 401(k) plan.

#J-18808-Ljbffr

Create a job alert for this search

Engineer Ai • San Francisco, CA, United States

Related jobs
  • Promoted
AI Infra Engineer

AI Infra Engineer

Pantera CapitalSan Francisco, CA, United States
Full-time
We are looking for an AI Infra engineer to join our growing team.We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering ...Show moreLast updated: 1 day ago
  • Promoted
  • New!
AIML - Software Engineer, Search and Cloud Infra

AIML - Software Engineer, Search and Cloud Infra

AppleSan Francisco, CA, United States
Full-time
Imagine what you could do here.At Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly. Bring passion and dedication to your job and there’s no t...Show moreLast updated: 16 hours ago
  • Promoted
Principal AI Engineer

Principal AI Engineer

SynopsysMountain View, CA, United States
Full-time
You are a passionate and driven individual with a degree in Computer Science, Computer Engineering, or Electrical Engineering. With a strong foundation in Artificial Intelligence algorithms and expe...Show moreLast updated: 30+ days ago
  • Promoted
AI Infrastructure Engineer - PlayerZero

AI Infrastructure Engineer - PlayerZero

HireOTSSan Francisco, CA, United States
Full-time
The platform is used by engineering and support teams to : .Autonomously debug problems in production software.Fix issues directly in the codebase. Prevent recurring issues through intelligent root-ca...Show moreLast updated: 30+ days ago
  • Promoted
Software Engineer (AI Infra)

Software Engineer (AI Infra)

Pylon LabsSan Francisco, CA, United States
Full-time
At Pylon, we're building the future of B2B Post Sales.We're building the all-in-one B2B post-sales support platform powered by conversational data and layered with intelligence to help our customer...Show moreLast updated: 30+ days ago
  • Promoted
  • New!
Software Engineer (AI Infra)

Software Engineer (AI Infra)

PylonSan Francisco, CA, United States
Full-time
At Pylon, we're building the future of B2B Post Sales.Were building the all-in-one B2B post-sales support platform powered by conversational data and layered with intelligence to help our customers...Show moreLast updated: 16 hours ago
  • Promoted
  • New!
AI Engineer LLM Infra

AI Engineer LLM Infra

YutoriSan Francisco, CA, United States
Full-time
Yutori is reimagining how people interact with the web by building AI agents that can reliably do everyday digital tasks. We are building the entire stack to be agent-first, from training our own mo...Show moreLast updated: 16 hours ago
  • Promoted
  • New!
AI Infrastructure Engineer, Core Infrastructure

AI Infrastructure Engineer, Core Infrastructure

Scale AISan Francisco, CA, United States
Full-time
As a Software Engineer on the ML Infrastructure team, you will design and build the next generation of foundational systems that power all ML Infrastructure compute at Scale - from model training a...Show moreLast updated: 16 hours ago
  • Promoted
AI Platform Engineer, Infrastructure

AI Platform Engineer, Infrastructure

Brain Co.San Francisco, CA, United States
Full-time
Applied AI startup founded by Elad Gil and Jared Kushner, and backed by many of Silicon Valley’s leading builders — including Patrick Collison (CEO of Stripe), Andrej Karpathy (Cofounder of OpenAI)...Show moreLast updated: 4 days ago
  • Promoted
AI Infrastructure Engineer, Model Serving Platform

AI Infrastructure Engineer, Model Serving Platform

Scale AI, Inc.San Francisco, CA, United States
Full-time
As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs. Our platform powers cutting-edge research and product...Show moreLast updated: 30+ days ago
  • Promoted
Software Engineer, AI Infra

Software Engineer, AI Infra

Shepherd Labs Inc.San Francisco, CA, United States
Full-time
We provide savings on insurance premiums for commercial businesses that are leveraging modern technology on their worksites. While we began with commercial construction, we're expanding into adjacen...Show moreLast updated: 30+ days ago
  • Promoted
  • New!
Software Engineer, AI Infra

Software Engineer, AI Infra

ShepherdSan Francisco, CA, United States
Full-time
We provide savings on insurance premiums for commercial businesses that are leveraging modern technology on their worksites. While we began with commercial construction, we're expanding into adjacen...Show moreLast updated: 16 hours ago
  • Promoted
Senior Infrastructure Software Engineer, Enterprise AI

Senior Infrastructure Software Engineer, Enterprise AI

Scale AI, Inc.San Francisco, CA, United States
Full-time
Scale GP is building the next generation of enterprise-grade Generative AI products.Our platform provides APIs for knowledge retrieval, inference, and evaluation, enabling customers to build and de...Show moreLast updated: 30+ days ago
  • Promoted
AI Engineer - LLM Infra

AI Engineer - LLM Infra

YutoriSan Francisco, CA, United States
Full-time
Yutori is reimagining how people interact with the web by building AI agents that can reliably do everyday digital tasks. We are building the entire stack to be agent-first, from training our own mo...Show moreLast updated: 30+ days ago
  • Promoted
  • New!
AI & HPC Infrastructure Engineer

AI & HPC Infrastructure Engineer

AccentureSan Francisco, CA, United States
Full-time
The Global Infrastructure Engineering AI & HPC team is at the center of enabling infrastructure reinvention for the next era of digital solutions powered by AI and High-Performance Computing (HPC)....Show moreLast updated: 16 hours ago
  • Promoted
  • New!
AI Infrastructure Engineer, Model Serving Platform

AI Infrastructure Engineer, Model Serving Platform

Scale AISan Francisco, CA, United States
Full-time
As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs. Our platform powers cutting-edge research and product...Show moreLast updated: 16 hours ago
  • Promoted
  • New!
Infra Engineer

Infra Engineer

Anything Corp.San Francisco, CA, United States
Full-time
Anything is the AI product engineer for the next wave of entrepreneurs.It's an AI agent that turns English into apps.Everything you need make money on the internet built in - mobile, web, design, A...Show moreLast updated: 16 hours ago
  • Promoted
Senior Software Engineer, AI Infra

Senior Software Engineer, AI Infra

Ambient AISan Francisco, CA, United States
Full-time
Build a safer world with us, one incident at a time.AI-powered physical security platform helping the world's leading enterprises reduce risk, improve operational efficiency, and gain critical insi...Show moreLast updated: 30+ days ago