Talent.com
No longer accepting applications
AI Infra Engineer

AI Infra Engineer

Perplexity AIPalo Alto, CA, US
1 day ago
Job type
  • Full-time
Job description

Perplexity is an AI-powered answer engine founded in December 2022 and growing rapidly as one of the world’s leading AI platforms. Perplexity has raised over $1B in venture investment from some of the world’s most visionary and successful leaders, including Elad Gil, Daniel Gross, Jeff Bezos, Accel, IVP, NEA, NVIDIA, Samsung, and many more. Our objective is to build accurate, trustworthy AI that powers decision-making for people and assistive AI wherever decisions are being made. Throughout human history, change and innovation have always been driven by curious people. Today, curious people use Perplexity to answer more than 780 million queries every month–a number that’s growing rapidly for one simple reason : everyone can be curious.

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will work in a hybrid SRE / Dev Engineering capacity, partnering closely with our Infrastructure and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters.

Responsibilities

  • Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
  • Manage and optimize Slurm-based HPC environments for distributed training of large language models
  • Develop robust APIs and orchestration systems for both training pipelines and inference services
  • Implement resource scheduling and job management systems across heterogeneous compute environments
  • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
  • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
  • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
  • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Qualifications

  • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
  • Experience with deploying and managing distributed training systems at scale
  • Deep understanding of container orchestration and distributed systems architecture
  • High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi / Grouped-Query, distributed training strategies)
  • Experience managing GPU clusters and optimizing compute resource utilization
  • Required Skills

  • Expert-level Kubernetes administration and YAML configuration management
  • Proficiency with Slurm job scheduling, resource management, and cluster configuration
  • Python and C++ programming with focus on systems and infrastructure automation
  • Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
  • Strong understanding of networking, storage, and compute resource management for ML workloads
  • Experience developing APIs and managing distributed systems for both batch and real-time workloads
  • Solid debugging and monitoring skills with expertise in observability tools for containerized environments
  • Preferred Skills

  • Experience with Kubernetes operators and custom controllers for ML workloads
  • Advanced Slurm administration including multi-cluster federation and advanced scheduling policies
  • Familiarity with GPU cluster management and CUDA optimization
  • Experience with other ML frameworks like TensorFlow or distributed training libraries
  • Background in HPC environments, parallel computing, and high-performance networking
  • Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
  • Experience with container registries, image optimization, and multi-stage builds for ML workloads
  • Required Experience

  • Demonstrated experience managing large-scale Kubernetes deployments in production environments
  • Proven track record with Slurm cluster administration and HPC workload management
  • Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
  • Experience supporting both long-running training jobs and high-availability inference services
  • Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management
  • The cash compensation range for this role is $190,000 - $250,000.

    Final offer amounts are determined by multiple factors, including, experience and expertise, and may vary from the amounts listed above.

    Equity : In addition to the base salary, equity may be part of the total compensation package.

    Benefits : Comprehensive health, dental, and vision insurance for you and your dependents. Includes a 401(k) plan.

    Create a Job Alert

    Interested in building your career at Perplexity AI? Get future opportunities sent straight to your email.

    Apply for this job

    indicates a required field

    First Name

    Last Name

    Email

    Phone

    Resume / CV

    Enter manually

    Accepted file types : pdf, doc, docx, txt, rtf

    Enter manually

    Accepted file types : pdf, doc, docx, txt, rtf

    Website

    LinkedIn Profile

    Will you now or in the future require visa sponsorship for employment?

  • Select...
  • Perplexity has an office-centric work model with 4 days per week in the office from the San Francisco Bay Area or New York City. Are you willing to come in 4 days per week?

  • Select...
  • If you are not based in any of these locations, are you open to relocation to San Francisco, Palo Alto, or New York City?

  • Select...
  • What are you looking for in your next role?

    #J-18808-Ljbffr

    Create a job alert for this search

    Engineer Ai • Palo Alto, CA, US

    Related jobs
    • Promoted
    AI Marketing Software Engineer

    AI Marketing Software Engineer

    VirtualVocationsHayward, California, United States
    Temporary
    A company is looking for an AI Marketing Software Engineer for a temporary position.Key Responsibilities Build and deploy automated agents for marketing use cases Develop and maintain prompt cha...Show moreLast updated: 2 days ago
    • Promoted
    Senior AI / ML Engineer

    Senior AI / ML Engineer

    VirtualVocationsHayward, California, United States
    Full-time
    A company is looking for a Senior AI / ML Engineer specializing in Generative AI to develop and implement advanced AI solutions. Key Responsibilities Implement and optimize AI orchestration framewor...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    AI Research Engineer

    AI Research Engineer

    VirtualVocationsHayward, California, United States
    Full-time
    A company is looking for an AI Research Engineer specializing in LLM orchestration and prompting.Key Responsibilities Build LLM-powered software by designing prompt flows and orchestrations for o...Show moreLast updated: 12 hours ago
    • Promoted
    Support Infrastructure Engineer

    Support Infrastructure Engineer

    VirtualVocationsFremont, California, United States
    Full-time
    A company is looking for a Support Infrastructure Engineer III.Key Responsibilities Lead the design and deployment of DDI solutions, managing Infoblox appliances for enterprise network infrastruc...Show moreLast updated: 1 day ago
    • Promoted
    Adversarial AI Engineer

    Adversarial AI Engineer

    GoFundMeSan Francisco, CA, United States
    Full-time
    Want to help us help others? We’re hiring! GoFundMe is the world’s most powerful community for good, dedicated to helping people help each other. By uniting individuals and nonprofits in one place, ...Show moreLast updated: 27 days ago
    • Promoted
    Sr. AI Infrastructure Software Engineer

    Sr. AI Infrastructure Software Engineer

    KLAMilpitas, CA, United States
    Full-time
    KLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem.Virtually every electronic device in the world is produced using our technologies.No laptop, smartpho...Show moreLast updated: 30+ days ago
    • Promoted
    Azure Data Engineer

    Azure Data Engineer

    VirtualVocationsHayward, California, United States
    Full-time
    A company is looking for an Azure Data Engineer to lead the administration, integration, and technical optimization of Microsoft Purview. Key Responsibilities Serve as the technical owner and admi...Show moreLast updated: 30+ days ago
    • Promoted
    AI Infrastructure Engineer, Model Serving Platform

    AI Infrastructure Engineer, Model Serving Platform

    Scale AI, Inc.San Francisco, CA, United States
    Full-time
    As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs. Our platform powers cutting-edge research and product...Show moreLast updated: 30+ days ago
    • Promoted
    Senior AI Model Engineer

    Senior AI Model Engineer

    VirtualVocationsFremont, California, United States
    Full-time
    A company is looking for a Senior AI Research Engineer, Model Inference (100% Remote).Key Responsibilities Implement and optimize custom inference and fine-tuning kernels for language models acro...Show moreLast updated: 17 days ago
    • Promoted
    AI Infrastructure Engineer

    AI Infrastructure Engineer

    SpellbrushSan Francisco, CA, US
    Full-time
    Spellbrush, the world’s leading generative AI studio behind.AI Infrastructure Engineer to join us in building out end-to-end ML infrastructure to run our models on all platforms.Design, imple...Show moreLast updated: 30+ days ago
    • Promoted
    AI Engineer

    AI Engineer

    Airwallex Pty Ltd.San Francisco, CA, United States
    Full-time
    Airwallex is the only unified payments and financial platform for global businesses.Powered by our unique combination of proprietary infrastructure and software, we empower over 150,000 businesses ...Show moreLast updated: 6 days ago
    • Promoted
    AI Infrastructure Engineer, ML Data Platform

    AI Infrastructure Engineer, ML Data Platform

    Scale AI, Inc.San Francisco, CA, United States
    Full-time
    Scale's AI Infrastructure team supports both R&D and applied Generative AI initiatives, driving breakthroughs in areas of post-training research such as AI safety, agents, and evaluating state-of-t...Show moreLast updated: 30+ days ago
    • Promoted
    AI Engineer

    AI Engineer

    VirtualVocationsFremont, California, United States
    Full-time
    A company is looking for an Associate AI Engineer / AI Engineer to build and enhance data solutions and AI initiatives for the insurance industry. Key Responsibilities Design and deliver AI-powere...Show moreLast updated: 30+ days ago
    • Promoted
    Data Engineer II

    Data Engineer II

    VirtualVocationsFremont, California, United States
    Full-time
    A company is looking for a Data Engineer II.Key Responsibilities Produce high-quality data models and maintain data integrity for analytics products Develop scalable ELT pipelines and business i...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Forward Deployed Engineer

    Senior Forward Deployed Engineer

    VirtualVocationsHayward, California, United States
    Full-time
    A company is looking for a Senior Forward Deployed Engineer, AI (Remote).Key Responsibilities Lead the design, development, and deployment of AI / ML-powered solutions tailored to customer needs A...Show moreLast updated: 30+ days ago
    • Promoted
    AI Engineer - LLM Infra

    AI Engineer - LLM Infra

    YutoriSan Francisco, CA, United States
    Full-time
    Yutori is reimagining how people interact with the web by building AI agents that can reliably do everyday digital tasks. We are building the entire stack to be agent-first, from training our own mo...Show moreLast updated: 30+ days ago
    • Promoted
    Software Engineer - AI / LLM

    Software Engineer - AI / LLM

    SupermicroSan Jose, CA, United States
    Full-time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...Show moreLast updated: 30+ days ago
    • Promoted
    Senior AI Engineer

    Senior AI Engineer

    VirtualVocationsFremont, California, United States
    Full-time
    A company is looking for a Senior AI Engineer.Key Responsibilities Design, build, and own LLM-powered services and orchestration for reliable, low-latency experiences Develop prompts, retrieval ...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Data & AI Engineer

    Data & AI Engineer

    VirtualVocationsHayward, California, United States
    Full-time
    A company is looking for a Data / AI Engineer (Gen AI, LLM, ML).Key Responsibilities Design, build, and maintain robust data pipelines and workflows for healthcare data Develop, train, and deploy ...Show moreLast updated: 6 hours ago
    • Promoted
    AI Engineer

    AI Engineer

    CerebrasSan Francisco, CA, United States
    Full-time
    We are not open to remote candidates.Campfire is on a mission to redefine the accounting software landscape by taking on giants like Netsuite to build modern accounting software for startups and mi...Show moreLast updated: 28 days ago