Talent.com
AI Infra Engineer

AI Infra Engineer

Pantera CapitalSan Francisco, CA, United States
1 day ago
Job type
  • Full-time
Job description

Location

San Francisco

Employment Type

Full time

Location Type

Hybrid

Department

AI

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters

Responsibilities

  • Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
  • Manage and optimize Slurm-based HPC environments for distributed training of large language models
  • Develop robust APIs and orchestration systems for both training pipelines and inference services
  • Implement resource scheduling and job management systems across heterogeneous compute environments
  • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
  • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
  • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
  • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Qualifications

  • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • Hands‑on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
  • Experience with deploying and managing distributed training systems at scale
  • Deep understanding of container orchestration and distributed systems architecture
  • High level familiarity with LLM architecture and training processes (Multi‑Head Attention, Multi / Grouped‑Query, distributed training strategies)
  • Experience managing GPU clusters and optimizing compute resource utilization
  • Required Skills

  • Expert‑level Kubernetes administration and YAML configuration management
  • Proficiency with Slurm job scheduling, resource management, and cluster configuration
  • Python and C++ programming with focus on systems and infrastructure automation
  • Hands‑on experience with ML frameworks such as PyTorch in distributed training contexts
  • Strong understanding of networking, storage, and compute resource management for ML workloads
  • Experience developing APIs and managing distributed systems for both batch and real‑time workloads
  • Solid debugging and monitoring skills with expertise in observability tools for containerized environments
  • Preferred Skills

  • Experience with Kubernetes operators and custom controllers for ML workloads
  • Advanced Slurm administration including multi‑cluster federation and advanced scheduling policies
  • Familiarity with GPU cluster management and CUDA optimization
  • Experience with other ML frameworks like TensorFlow or distributed training libraries
  • Background in HPC environments, parallel computing, and high‑performance networking
  • Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
  • Experience with container registries, image optimization, and multi‑stage builds for ML workloads
  • Required Experience

  • Demonstrated experience managing large‑scale Kubernetes deployments in production environments
  • Proven track record with Slurm cluster administration and HPC workload management
  • Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
  • Experience supporting both long‑running training jobs and high‑availability inference services
  • Ideally, 3‑5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management
  • The cash compensation range for this role is $190,000 - $250,000.

    Final offer amounts are determined by multiple factors, including, experience and expertise, and may vary from the amounts listed above.

    Equity : In addition to the base salary, equity may be part of the total compensation package.

    Benefits : Comprehensive health, dental, and vision insurance for you and your dependents. Includes a 401(k) plan.

    #J-18808-Ljbffr

    Create a job alert for this search

    Engineer Ai • San Francisco, CA, United States

    Related jobs
    • Promoted
    • New!
    AIML - Software Engineer, Search and Cloud Infra

    AIML - Software Engineer, Search and Cloud Infra

    AppleSan Francisco, CA, United States
    Full-time
    Imagine what you could do here.At Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly. Bring passion and dedication to your job and there’s no t...Show moreLast updated: 20 hours ago
    • Promoted
    Principal AI Engineer

    Principal AI Engineer

    SynopsysMountain View, CA, United States
    Full-time
    You are a passionate and driven individual with a degree in Computer Science, Computer Engineering, or Electrical Engineering. With a strong foundation in Artificial Intelligence algorithms and expe...Show moreLast updated: 30+ days ago
    • Promoted
    AI Infrastructure Engineer - PlayerZero

    AI Infrastructure Engineer - PlayerZero

    HireOTSSan Francisco, CA, United States
    Full-time
    The platform is used by engineering and support teams to : .Autonomously debug problems in production software.Fix issues directly in the codebase. Prevent recurring issues through intelligent root-ca...Show moreLast updated: 30+ days ago
    • Promoted
    Software Engineer (AI Infra)

    Software Engineer (AI Infra)

    Pylon LabsSan Francisco, CA, United States
    Full-time
    At Pylon, we're building the future of B2B Post Sales.We're building the all-in-one B2B post-sales support platform powered by conversational data and layered with intelligence to help our customer...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    AI & HPC Infrastructure Engineer

    AI & HPC Infrastructure Engineer

    AccentureWalnut Creek, CA, United States
    Full-time
    The Global Infrastructure Engineering AI & HPC team is at the center of enabling infrastructure reinvention for the next era of digital solutions powered by AI and High-Performance Computing (HPC)....Show moreLast updated: 20 hours ago
    • Promoted
    • New!
    Software Engineer (AI Infra)

    Software Engineer (AI Infra)

    PylonSan Francisco, CA, United States
    Full-time
    At Pylon, we're building the future of B2B Post Sales.Were building the all-in-one B2B post-sales support platform powered by conversational data and layered with intelligence to help our customers...Show moreLast updated: 20 hours ago
    • Promoted
    • New!
    AI Engineer LLM Infra

    AI Engineer LLM Infra

    YutoriSan Francisco, CA, United States
    Full-time
    Yutori is reimagining how people interact with the web by building AI agents that can reliably do everyday digital tasks. We are building the entire stack to be agent-first, from training our own mo...Show moreLast updated: 20 hours ago
    • Promoted
    • New!
    AI Infrastructure Engineer, Core Infrastructure

    AI Infrastructure Engineer, Core Infrastructure

    Scale AISan Francisco, CA, United States
    Full-time
    As a Software Engineer on the ML Infrastructure team, you will design and build the next generation of foundational systems that power all ML Infrastructure compute at Scale - from model training a...Show moreLast updated: 20 hours ago
    • Promoted
    AI Platform Engineer, Infrastructure

    AI Platform Engineer, Infrastructure

    Brain Co.San Francisco, CA, United States
    Full-time
    Applied AI startup founded by Elad Gil and Jared Kushner, and backed by many of Silicon Valley’s leading builders — including Patrick Collison (CEO of Stripe), Andrej Karpathy (Cofounder of OpenAI)...Show moreLast updated: 4 days ago
    • Promoted
    AI Infrastructure Engineer, Model Serving Platform

    AI Infrastructure Engineer, Model Serving Platform

    Scale AI, Inc.San Francisco, CA, United States
    Full-time
    As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs. Our platform powers cutting-edge research and product...Show moreLast updated: 30+ days ago
    • Promoted
    AI Infra Engineer

    AI Infra Engineer

    Perplexity AI Inc.San Francisco, CA, United States
    Full-time
    We are looking for an AI Infra engineer to join our growing team.We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering ...Show moreLast updated: 4 days ago
    • Promoted
    Software Engineer, AI Infra

    Software Engineer, AI Infra

    Shepherd Labs Inc.San Francisco, CA, United States
    Full-time
    We provide savings on insurance premiums for commercial businesses that are leveraging modern technology on their worksites. While we began with commercial construction, we're expanding into adjacen...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Software Engineer, AI Infra

    Software Engineer, AI Infra

    ShepherdSan Francisco, CA, United States
    Full-time
    We provide savings on insurance premiums for commercial businesses that are leveraging modern technology on their worksites. While we began with commercial construction, we're expanding into adjacen...Show moreLast updated: 20 hours ago
    • Promoted
    Senior Infrastructure Software Engineer, Enterprise AI

    Senior Infrastructure Software Engineer, Enterprise AI

    Scale AI, Inc.San Francisco, CA, United States
    Full-time
    Scale GP is building the next generation of enterprise-grade Generative AI products.Our platform provides APIs for knowledge retrieval, inference, and evaluation, enabling customers to build and de...Show moreLast updated: 30+ days ago
    • Promoted
    AI Engineer - LLM Infra

    AI Engineer - LLM Infra

    YutoriSan Francisco, CA, United States
    Full-time
    Yutori is reimagining how people interact with the web by building AI agents that can reliably do everyday digital tasks. We are building the entire stack to be agent-first, from training our own mo...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    AI Infrastructure Engineer, Model Serving Platform

    AI Infrastructure Engineer, Model Serving Platform

    Scale AISan Francisco, CA, United States
    Full-time
    As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs. Our platform powers cutting-edge research and product...Show moreLast updated: 20 hours ago
    • Promoted
    • New!
    Infra Engineer

    Infra Engineer

    Anything Corp.San Francisco, CA, United States
    Full-time
    Anything is the AI product engineer for the next wave of entrepreneurs.It's an AI agent that turns English into apps.Everything you need make money on the internet built in - mobile, web, design, A...Show moreLast updated: 20 hours ago
    • Promoted
    Senior Software Engineer, AI Infra

    Senior Software Engineer, AI Infra

    Ambient AISan Francisco, CA, United States
    Full-time
    Build a safer world with us, one incident at a time.AI-powered physical security platform helping the world's leading enterprises reduce risk, improve operational efficiency, and gain critical insi...Show moreLast updated: 30+ days ago