AI Infra Engineer

Pantera CapitalSan Francisco, CA, United States

1 day ago

Job type

Full-time

Job description

Location

San Francisco

Employment Type

Full time

Location Type

Hybrid

Department

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters

Responsibilities

Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
Manage and optimize Slurm-based HPC environments for distributed training of large language models
Develop robust APIs and orchestration systems for both training pipelines and inference services
Implement resource scheduling and job management systems across heterogeneous compute environments
Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Qualifications

Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management

Hands‑on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization

Experience with deploying and managing distributed training systems at scale

Deep understanding of container orchestration and distributed systems architecture

High level familiarity with LLM architecture and training processes (Multi‑Head Attention, Multi / Grouped‑Query, distributed training strategies)

Experience managing GPU clusters and optimizing compute resource utilization

Required Skills

Expert‑level Kubernetes administration and YAML configuration management

Proficiency with Slurm job scheduling, resource management, and cluster configuration

Python and C++ programming with focus on systems and infrastructure automation

Hands‑on experience with ML frameworks such as PyTorch in distributed training contexts

Strong understanding of networking, storage, and compute resource management for ML workloads

Experience developing APIs and managing distributed systems for both batch and real‑time workloads

Solid debugging and monitoring skills with expertise in observability tools for containerized environments

Preferred Skills

Experience with Kubernetes operators and custom controllers for ML workloads

Advanced Slurm administration including multi‑cluster federation and advanced scheduling policies

Familiarity with GPU cluster management and CUDA optimization

Experience with other ML frameworks like TensorFlow or distributed training libraries

Background in HPC environments, parallel computing, and high‑performance networking

Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices

Experience with container registries, image optimization, and multi‑stage builds for ML workloads

Required Experience

Demonstrated experience managing large‑scale Kubernetes deployments in production environments

Proven track record with Slurm cluster administration and HPC workload management

Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure

Experience supporting both long‑running training jobs and high‑availability inference services

Ideally, 3‑5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

The cash compensation range for this role is $190,000 - $250,000.

Final offer amounts are determined by multiple factors, including, experience and expertise, and may vary from the amounts listed above.

Equity : In addition to the base salary, equity may be part of the total compensation package.

Benefits : Comprehensive health, dental, and vision insurance for you and your dependents. Includes a 401(k) plan.

#J-18808-Ljbffr

Create a job alert for this search

Engineer Ai • San Francisco, CA, United States

Related jobs

Promoted
New!

AIML - Software Engineer, Search and Cloud Infra

AppleSan Francisco, CA, United States

Full-time

Imagine what you could do here.At Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly. Bring passion and dedication to your job and there’s no t...Show moreLast updated: 20 hours ago

Promoted

Principal AI Engineer

SynopsysMountain View, CA, United States

Full-time

You are a passionate and driven individual with a degree in Computer Science, Computer Engineering, or Electrical Engineering. With a strong foundation in Artificial Intelligence algorithms and expe...Show moreLast updated: 30+ days ago

Promoted

AI Infrastructure Engineer - PlayerZero

HireOTSSan Francisco, CA, United States

Full-time

The platform is used by engineering and support teams to : .Autonomously debug problems in production software.Fix issues directly in the codebase. Prevent recurring issues through intelligent root-ca...Show moreLast updated: 30+ days ago

Promoted

Software Engineer (AI Infra)

Pylon LabsSan Francisco, CA, United States

Full-time

At Pylon, we're building the future of B2B Post Sales.We're building the all-in-one B2B post-sales support platform powered by conversational data and layered with intelligence to help our customer...Show moreLast updated: 30+ days ago

Promoted
New!

AI & HPC Infrastructure Engineer

AccentureWalnut Creek, CA, United States

Full-time

The Global Infrastructure Engineering AI & HPC team is at the center of enabling infrastructure reinvention for the next era of digital solutions powered by AI and High-Performance Computing (HPC)....Show moreLast updated: 20 hours ago

Promoted
New!

Software Engineer (AI Infra)

PylonSan Francisco, CA, United States

Full-time

At Pylon, we're building the future of B2B Post Sales.Were building the all-in-one B2B post-sales support platform powered by conversational data and layered with intelligence to help our customers...Show moreLast updated: 20 hours ago

Promoted
New!

AI Engineer LLM Infra

YutoriSan Francisco, CA, United States

Full-time

Yutori is reimagining how people interact with the web by building AI agents that can reliably do everyday digital tasks. We are building the entire stack to be agent-first, from training our own mo...Show moreLast updated: 20 hours ago

Promoted
New!

AI Infrastructure Engineer, Core Infrastructure

Scale AISan Francisco, CA, United States

Full-time

As a Software Engineer on the ML Infrastructure team, you will design and build the next generation of foundational systems that power all ML Infrastructure compute at Scale - from model training a...Show moreLast updated: 20 hours ago

Promoted

AI Platform Engineer, Infrastructure

Brain Co.San Francisco, CA, United States

Full-time

Applied AI startup founded by Elad Gil and Jared Kushner, and backed by many of Silicon Valley’s leading builders — including Patrick Collison (CEO of Stripe), Andrej Karpathy (Cofounder of OpenAI)...Show moreLast updated: 4 days ago

Promoted

AI Infrastructure Engineer, Model Serving Platform

Scale AI, Inc.San Francisco, CA, United States

Full-time

As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs. Our platform powers cutting-edge research and product...Show moreLast updated: 30+ days ago

Promoted

AI Infra Engineer

Perplexity AI Inc.San Francisco, CA, United States

Full-time

We are looking for an AI Infra engineer to join our growing team.We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering ...Show moreLast updated: 4 days ago

Promoted

Software Engineer, AI Infra

Shepherd Labs Inc.San Francisco, CA, United States

Full-time

We provide savings on insurance premiums for commercial businesses that are leveraging modern technology on their worksites. While we began with commercial construction, we're expanding into adjacen...Show moreLast updated: 30+ days ago

Promoted
New!

Software Engineer, AI Infra

ShepherdSan Francisco, CA, United States

Full-time

Promoted

Senior Infrastructure Software Engineer, Enterprise AI

Scale AI, Inc.San Francisco, CA, United States

Full-time

Scale GP is building the next generation of enterprise-grade Generative AI products.Our platform provides APIs for knowledge retrieval, inference, and evaluation, enabling customers to build and de...Show moreLast updated: 30+ days ago

Promoted

AI Engineer - LLM Infra

YutoriSan Francisco, CA, United States

Full-time

Promoted
New!

AI Infrastructure Engineer, Model Serving Platform

Scale AISan Francisco, CA, United States

Full-time

Promoted
New!

Infra Engineer

Anything Corp.San Francisco, CA, United States

Full-time

Anything is the AI product engineer for the next wave of entrepreneurs.It's an AI agent that turns English into apps.Everything you need make money on the internet built in - mobile, web, design, A...Show moreLast updated: 20 hours ago

Promoted

Senior Software Engineer, AI Infra

Ambient AISan Francisco, CA, United States

Full-time

Build a safer world with us, one incident at a time.AI-powered physical security platform helping the world's leading enterprises reduce risk, improve operational efficiency, and gain critical insi...Show moreLast updated: 30+ days ago