No longer accepting applications

AI Infra Engineer

Perplexity AIPalo Alto, CA, US

1 day ago

Job type

Full-time

Job description

Perplexity is an AI-powered answer engine founded in December 2022 and growing rapidly as one of the world’s leading AI platforms. Perplexity has raised over $1B in venture investment from some of the world’s most visionary and successful leaders, including Elad Gil, Daniel Gross, Jeff Bezos, Accel, IVP, NEA, NVIDIA, Samsung, and many more. Our objective is to build accurate, trustworthy AI that powers decision-making for people and assistive AI wherever decisions are being made. Throughout human history, change and innovation have always been driven by curious people. Today, curious people use Perplexity to answer more than 780 million queries every month–a number that’s growing rapidly for one simple reason : everyone can be curious.

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will work in a hybrid SRE / Dev Engineering capacity, partnering closely with our Infrastructure and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters.

Responsibilities

Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
Manage and optimize Slurm-based HPC environments for distributed training of large language models
Develop robust APIs and orchestration systems for both training pipelines and inference services
Implement resource scheduling and job management systems across heterogeneous compute environments
Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Qualifications

Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management

Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization

Experience with deploying and managing distributed training systems at scale

Deep understanding of container orchestration and distributed systems architecture

High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi / Grouped-Query, distributed training strategies)

Experience managing GPU clusters and optimizing compute resource utilization

Required Skills

Expert-level Kubernetes administration and YAML configuration management

Proficiency with Slurm job scheduling, resource management, and cluster configuration

Python and C++ programming with focus on systems and infrastructure automation

Hands-on experience with ML frameworks such as PyTorch in distributed training contexts

Strong understanding of networking, storage, and compute resource management for ML workloads

Experience developing APIs and managing distributed systems for both batch and real-time workloads

Solid debugging and monitoring skills with expertise in observability tools for containerized environments

Preferred Skills

Experience with Kubernetes operators and custom controllers for ML workloads

Advanced Slurm administration including multi-cluster federation and advanced scheduling policies

Familiarity with GPU cluster management and CUDA optimization

Experience with other ML frameworks like TensorFlow or distributed training libraries

Background in HPC environments, parallel computing, and high-performance networking

Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices

Experience with container registries, image optimization, and multi-stage builds for ML workloads

Required Experience

Demonstrated experience managing large-scale Kubernetes deployments in production environments

Proven track record with Slurm cluster administration and HPC workload management

Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure

Experience supporting both long-running training jobs and high-availability inference services

Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

The cash compensation range for this role is $190,000 - $250,000.

Final offer amounts are determined by multiple factors, including, experience and expertise, and may vary from the amounts listed above.

Equity : In addition to the base salary, equity may be part of the total compensation package.

Benefits : Comprehensive health, dental, and vision insurance for you and your dependents. Includes a 401(k) plan.

Create a Job Alert

Interested in building your career at Perplexity AI? Get future opportunities sent straight to your email.

Apply for this job

indicates a required field

First Name

Last Name

Phone

Resume / CV

Enter manually

Accepted file types : pdf, doc, docx, txt, rtf

Enter manually

Accepted file types : pdf, doc, docx, txt, rtf

Website

LinkedIn Profile

Will you now or in the future require visa sponsorship for employment?

Select...

Perplexity has an office-centric work model with 4 days per week in the office from the San Francisco Bay Area or New York City. Are you willing to come in 4 days per week?

Select...

If you are not based in any of these locations, are you open to relocation to San Francisco, Palo Alto, or New York City?

Select...

What are you looking for in your next role?

#J-18808-Ljbffr

Create a job alert for this search

Engineer Ai • Palo Alto, CA, US

Related jobs

Promoted

AI Marketing Software Engineer

VirtualVocationsHayward, California, United States

Temporary

A company is looking for an AI Marketing Software Engineer for a temporary position.Key Responsibilities Build and deploy automated agents for marketing use cases Develop and maintain prompt cha...Show moreLast updated: 2 days ago

Promoted

Senior AI / ML Engineer

VirtualVocationsHayward, California, United States

Full-time

A company is looking for a Senior AI / ML Engineer specializing in Generative AI to develop and implement advanced AI solutions. Key Responsibilities Implement and optimize AI orchestration framewor...Show moreLast updated: 30+ days ago

Promoted
New!

AI Research Engineer

VirtualVocationsHayward, California, United States

Full-time

A company is looking for an AI Research Engineer specializing in LLM orchestration and prompting.Key Responsibilities Build LLM-powered software by designing prompt flows and orchestrations for o...Show moreLast updated: 12 hours ago

Promoted

Support Infrastructure Engineer

VirtualVocationsFremont, California, United States

Full-time

A company is looking for a Support Infrastructure Engineer III.Key Responsibilities Lead the design and deployment of DDI solutions, managing Infoblox appliances for enterprise network infrastruc...Show moreLast updated: 1 day ago

Promoted

Adversarial AI Engineer

GoFundMeSan Francisco, CA, United States

Full-time

Want to help us help others? We’re hiring! GoFundMe is the world’s most powerful community for good, dedicated to helping people help each other. By uniting individuals and nonprofits in one place, ...Show moreLast updated: 27 days ago

Promoted

Sr. AI Infrastructure Software Engineer

KLAMilpitas, CA, United States

Full-time

KLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem.Virtually every electronic device in the world is produced using our technologies.No laptop, smartpho...Show moreLast updated: 30+ days ago

Promoted

Azure Data Engineer

VirtualVocationsHayward, California, United States

Full-time

A company is looking for an Azure Data Engineer to lead the administration, integration, and technical optimization of Microsoft Purview. Key Responsibilities Serve as the technical owner and admi...Show moreLast updated: 30+ days ago

Promoted

AI Infrastructure Engineer, Model Serving Platform

Scale AI, Inc.San Francisco, CA, United States

Full-time

As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs. Our platform powers cutting-edge research and product...Show moreLast updated: 30+ days ago

Promoted

Senior AI Model Engineer

VirtualVocationsFremont, California, United States

Full-time

A company is looking for a Senior AI Research Engineer, Model Inference (100% Remote).Key Responsibilities Implement and optimize custom inference and fine-tuning kernels for language models acro...Show moreLast updated: 17 days ago

Promoted

AI Infrastructure Engineer

SpellbrushSan Francisco, CA, US

Full-time

Spellbrush, the world’s leading generative AI studio behind.AI Infrastructure Engineer to join us in building out end-to-end ML infrastructure to run our models on all platforms.Design, imple...Show moreLast updated: 30+ days ago

Promoted

AI Engineer

Airwallex Pty Ltd.San Francisco, CA, United States

Full-time

Airwallex is the only unified payments and financial platform for global businesses.Powered by our unique combination of proprietary infrastructure and software, we empower over 150,000 businesses ...Show moreLast updated: 6 days ago

Promoted

AI Infrastructure Engineer, ML Data Platform

Scale AI, Inc.San Francisco, CA, United States

Full-time

Scale's AI Infrastructure team supports both R&D and applied Generative AI initiatives, driving breakthroughs in areas of post-training research such as AI safety, agents, and evaluating state-of-t...Show moreLast updated: 30+ days ago

Promoted

AI Engineer

VirtualVocationsFremont, California, United States

Full-time

A company is looking for an Associate AI Engineer / AI Engineer to build and enhance data solutions and AI initiatives for the insurance industry. Key Responsibilities Design and deliver AI-powere...Show moreLast updated: 30+ days ago

Promoted

Data Engineer II

VirtualVocationsFremont, California, United States

Full-time

A company is looking for a Data Engineer II.Key Responsibilities Produce high-quality data models and maintain data integrity for analytics products Develop scalable ELT pipelines and business i...Show moreLast updated: 30+ days ago

Promoted

Senior Forward Deployed Engineer

VirtualVocationsHayward, California, United States

Full-time

A company is looking for a Senior Forward Deployed Engineer, AI (Remote).Key Responsibilities Lead the design, development, and deployment of AI / ML-powered solutions tailored to customer needs A...Show moreLast updated: 30+ days ago

Promoted

AI Engineer - LLM Infra

YutoriSan Francisco, CA, United States

Full-time

Yutori is reimagining how people interact with the web by building AI agents that can reliably do everyday digital tasks. We are building the entire stack to be agent-first, from training our own mo...Show moreLast updated: 30+ days ago

Promoted

Software Engineer - AI / LLM

SupermicroSan Jose, CA, United States

Full-time

Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...Show moreLast updated: 30+ days ago

Promoted

Senior AI Engineer

VirtualVocationsFremont, California, United States

Full-time

A company is looking for a Senior AI Engineer.Key Responsibilities Design, build, and own LLM-powered services and orchestration for reliable, low-latency experiences Develop prompts, retrieval ...Show moreLast updated: 30+ days ago

Promoted
New!

Data & AI Engineer

VirtualVocationsHayward, California, United States

Full-time

A company is looking for a Data / AI Engineer (Gen AI, LLM, ML).Key Responsibilities Design, build, and maintain robust data pipelines and workflows for healthcare data Develop, train, and deploy ...Show moreLast updated: 6 hours ago

Promoted

AI Engineer

CerebrasSan Francisco, CA, United States

Full-time

We are not open to remote candidates.Campfire is on a mission to redefine the accounting software landscape by taking on giants like Netsuite to build modern accounting software for startups and mi...Show moreLast updated: 28 days ago