Talent.com
Product Infrastructure Engineer - Site Reliability

Product Infrastructure Engineer - Site Reliability

ZyphraPalo Alto, California, United States
30+ days ago
Job type
  • Full-time
Job description

Zyphra is an artificial intelligence company based in Palo Alto, California.

The Role :

As a Infrastructure Engineer - Site Reliability , you’ll be responsible for designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable. Your work will be essential to ensuring the reliability and reproducibility of ML workloads, the safety and control of deployments, and the long-term maintainability of our compute environments.

You’ll work across :

Building and improving observability systems (monitoring, logging, alerting)

Designing resilient build and deployment systems across research and production environments

Implementing secure release processes with strong auditability and rollback support

Collaborating closely with ML engineers, DevOps, and infra teams to improve system reliability and performance

Leading incident response, root-cause analysis, and postmortems with a focus on learning and prevention

This role is ideal for someone who loves building systems that make other teams faster, safer, and more productive

Requirements :

Experience in high-performance compute environments, such as ML clusters or GPU farms

Background in infrastructure as code (e.g., Ansible, Terraform)

Familiarity with software release engineering with for ML / AI systems is a plus

Experience designing reliable environments for experimental workloads and reproducible runs

Knowledge of compliance and audit standards in deployment and system security

Experience with load testing, fault injection, and chaos engineering to harden systems under stress

Passion for building tooling that makes infrastructure invisible and reliable for end users

Bonus Qualifications :

Experience with infrastructure as code (e.g., Ansible, Terraform)

Prior work supporting ML / AI infrastructure, including GPU management and workload optimization

Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)

Experience working with cloud platforms such as AWS, Azure, or GCP

Familiarity with containers (Docker, Apptainer) and their integration with scheduling systems (Slurm, Kubernetes)

Why Work at Zyphra :

Our research methodology is to make grounded, methodical steps toward ambitious goals. Both deep research and engineering excellence are equally valued

We strongly value new and crazy ideas and are very willing to bet big on new ideas

We move as quickly as we can; we aim to minimize the bar to impact as low as possible

We all enjoy what we do and love discussing AI

Benefits and Perks :

Comprehensive medical, dental, vision, and FSA plans

Competitive compensation and 401(k)

Relocation and immigration support on a case-by-case basis

On-site meals prepared by a dedicated culinary team; Thursday Happy Hours

In-person team in Palo Alto, CA, with a collaborative, high-energy environment

If you are excited to bring reliability best practices to the frontier of AI infrastructure, this job is for you. Apply Today!

Create a job alert for this search

Site Reliability Engineer • Palo Alto, California, United States

Related jobs
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

ConductorOneSan Francisco, CA, United States
Full-time
ConductorOne is the first AI-native identity security platform that protects every identity : human, non-human, and AI.With powerful automation, platform-level AI, and out-of-the-box connectors, it ...Show moreLast updated: 30+ days ago
  • Promoted
Principal Site Reliability Engineer

Principal Site Reliability Engineer

FortinetSanta Clara, CA, United States
Full-time
At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...Show moreLast updated: 30+ days ago
  • Promoted
Site Reliability Engineer I

Site Reliability Engineer I

ProsperSan Francisco, CA, United States
Full-time
As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show moreLast updated: 23 days ago
  • Promoted
Site Reliability Engineer — GPU Infrastructure

Site Reliability Engineer — GPU Infrastructure

GenmoSan Francisco, CA, United States
Full-time
Site Reliability Engineer — GPU Infrastructure.Join Genmo, a research lab dedicated to building open, state‑of‑the‑art models for video generation. We are looking for a Site Reliability Engineer to ...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer – Platform

Senior Site Reliability Engineer – Platform

Icon VenturesSan Francisco, CA, United States
Full-time
At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.We blend cognitive science with machine learning to personalize and enhance the lear...Show moreLast updated: 1 day ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States
Full-time
Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
  • Promoted
Staff ML Infrastructure Engineer

Staff ML Infrastructure Engineer

Cubiq RecruitmentFremont, CA, United States
Full-time
Staff / Lead ML Infrastructure Engineer.Salary - Over market average + equity.We are building one of the world’s leading generative video and multimodal AI platforms, and we’re looking for a senior...Show moreLast updated: 1 day ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

Archetype AIPalo Alto, CA, United States
Full-time
Get AI-powered advice on this job and more exclusive features.Archetype AI is developing the world's first AI platform to bring AI into the real world. Formed by an exceptionally high-caliber team f...Show moreLast updated: 1 day ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

Together AISan Francisco, CA, United States
Full-time
As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show moreLast updated: 30+ days ago
  • Promoted
  • New!
Site Reliability Engineer I

Site Reliability Engineer I

Prosper MarketplaceSan Francisco, California, United States
Full-time
Your role in our mission As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform.This ent...Show moreLast updated: 1 hour ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

Runloop AISan Francisco, CA, United States
Full-time
Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show moreLast updated: 27 days ago
  • Promoted
Site Reliability Engineer, Frontier Systems Infrastructure

Site Reliability Engineer, Frontier Systems Infrastructure

OpenAISan Francisco, CA, United States
Full-time
The Frontier Systems team at OpenAI builds, launches, and supports the largest supercomputers in the world that OpenAI uses for its most cutting edge model training. We take data center designs, tur...Show moreLast updated: 14 days ago
  • Promoted
  • New!
Site Reliability Engineer

Site Reliability Engineer

Flexton, Inc.San Francisco, California, United States
Full-time
Site Reliability Engineer Location : .Contract to Hire Skill : You have excellent written and verbal communication skills. You have experience managing large websites or services within the context of ...Show moreLast updated: 1 hour ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

Sigmaways IncSan Francisco, California, United States
Full-time
As a Site reliability engineer, you will partner with development and IT teams to implement CI / CD pipelines, develop automation and monitoring solutions to ensure our platforms are secure, scalable...Show moreLast updated: 1 day ago
  • Promoted
  • New!
Site Reliability Engineer

Site Reliability Engineer

PrimerSan Francisco, California, United States
Full-time
Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show moreLast updated: 1 hour ago
  • Promoted
  • New!
Site Reliability Engineer

Site Reliability Engineer

LatentSan Francisco, California, United States
Full-time
Location : San Francisco, CA (5 Days In-Office).You are the infrastructure expert who enables our rapid product development and guarantees. AI platform for major health systems.Your focus on operatio...Show moreLast updated: 1 hour ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

P2PSan Francisco, CA, United States
Full-time
Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show moreLast updated: 30+ days ago
  • Promoted
Site Reliability Engineer II

Site Reliability Engineer II

Hinge HealthSan Francisco, CA, United States
Full-time
From scaling Kubernetes clusters to improving observability with Datadog, we build the tooling and automation that empower product teams to ship with confidence. Collaborate with engineering teams t...Show moreLast updated: 30+ days ago