Talent.com
Product Infrastructure Engineer - Site Reliability
Product Infrastructure Engineer - Site ReliabilityZyphra • Palo Alto, California, United States
Product Infrastructure Engineer - Site Reliability

Product Infrastructure Engineer - Site Reliability

Zyphra • Palo Alto, California, United States
30+ days ago
Job type
  • Full-time
Job description

Zyphra is an artificial intelligence company based in Palo Alto, California.

The Role :

As a Infrastructure Engineer - Site Reliability , you’ll be responsible for designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable. Your work will be essential to ensuring the reliability and reproducibility of ML workloads, the safety and control of deployments, and the long-term maintainability of our compute environments.

You’ll work across :

Building and improving observability systems (monitoring, logging, alerting)

Designing resilient build and deployment systems across research and production environments

Implementing secure release processes with strong auditability and rollback support

Collaborating closely with ML engineers, DevOps, and infra teams to improve system reliability and performance

Leading incident response, root-cause analysis, and postmortems with a focus on learning and prevention

This role is ideal for someone who loves building systems that make other teams faster, safer, and more productive

Requirements :

Experience in high-performance compute environments, such as ML clusters or GPU farms

Background in infrastructure as code (e.g., Ansible, Terraform)

Familiarity with software release engineering with for ML / AI systems is a plus

Experience designing reliable environments for experimental workloads and reproducible runs

Knowledge of compliance and audit standards in deployment and system security

Experience with load testing, fault injection, and chaos engineering to harden systems under stress

Passion for building tooling that makes infrastructure invisible and reliable for end users

Bonus Qualifications :

Experience with infrastructure as code (e.g., Ansible, Terraform)

Prior work supporting ML / AI infrastructure, including GPU management and workload optimization

Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)

Experience working with cloud platforms such as AWS, Azure, or GCP

Familiarity with containers (Docker, Apptainer) and their integration with scheduling systems (Slurm, Kubernetes)

Why Work at Zyphra :

Our research methodology is to make grounded, methodical steps toward ambitious goals. Both deep research and engineering excellence are equally valued

We strongly value new and crazy ideas and are very willing to bet big on new ideas

We move as quickly as we can; we aim to minimize the bar to impact as low as possible

We all enjoy what we do and love discussing AI

Benefits and Perks :

Comprehensive medical, dental, vision, and FSA plans

Competitive compensation and 401(k)

Relocation and immigration support on a case-by-case basis

On-site meals prepared by a dedicated culinary team; Thursday Happy Hours

In-person team in Palo Alto, CA, with a collaborative, high-energy environment

If you are excited to bring reliability best practices to the frontier of AI infrastructure, this job is for you. Apply Today!

Create a job alert for this search

Site Reliability Engineer • Palo Alto, California, United States

Related jobs
Principal Site Reliability Engineer (SASE)

Principal Site Reliability Engineer (SASE)

Palo Alto Networks • Cupertino, California, United States
Full-time
At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and m...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer - SRE at Descope Los Altos, CA

Site Reliability Engineer - SRE at Descope Los Altos, CA

Itlearn360 • Los Altos, CA, United States
Full-time
Site Reliability Engineer - SRE job at Descope.Descope R&D group is a skilled team of developers with a unique DNA of creativity,flexibility,anopen mindset. We are looking for a passionate SRE to jo...Show more
Last updated: 30+ days ago • Promoted
Senior Technology Site Reliability Engineer

Senior Technology Site Reliability Engineer

Cooley LLP • Palo Alto, CA, United States
Full-time
Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...Show more
Last updated: 3 days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

FLUIX • Palo Alto, CA, United States
Full-time
FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure.We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and pow...Show more
Last updated: 10 days ago • Promoted
Sr. Site Reliability Engineer

Sr. Site Reliability Engineer

Globality • Palo Alto, California, United States
Full-time
Joel Hyatt and Lior Delgo founded Globality with a vision to create prosperous and healthy economies, companies, communities, and individuals. In this new era of the Autonomous Enterprise, Globality...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

PsiQuantum • Palo Alto, CA, United States
Full-time
Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

Psiquantum • Palo Alto, California, United States
Full-time
Quantum computing holds the promise of humanity’s mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

Foxconn Industrial Internet - FII • San Jose, CA, US
Full-time +1
Quick Apply
Site Reliability Engineer Foxconn Industrial Internet (Fii), is a world leading professional design and manufacturing service provider of communication network equipment, cloud service equipment, p...Show more
Last updated: 30+ days ago
Sr. Reliability Engineer (26861)

Sr. Reliability Engineer (26861)

Supermicro • San Jose, CA, United States
Full-time
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

Natcast • Sunnyvale, California, United States
Full-time
Natcast (short for The National Center for the Advancement of Semiconductor Technology) is a new, purpose-built, non-profit entity created to operate the National Semiconductor Technology Center (N...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer (SRE) - Media Production Infrastructure

Site Reliability Engineer (SRE) - Media Production Infrastructure

Monks • Cupertino, California, United States
Full-time
Please note that we will never request payment or bank account information at any stage of the recruitment process.As we continue to grow our teams, we urge you to be cautious of fraudulent job pos...Show more
Last updated: 23 days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

Tarana Wireless • Milpitas, California, United States
Full-time
Join the Team That's Redefining Wireless Technology.Our groundbreaking Fixed Wireless Access technology is delivering .As a Site Reliability Engineer, you will help us manage software that runs on ...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

Key2Source • San Leandro, California, USA
Full-time
Job Title : Site Reliability Engineer.Location : San Leandro CA (Onsite).Engineering experience or equivalent demonstrated through one or a combination of the following : work experience training mili...Show more
Last updated: 14 days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

Cypress HCM • Hayward, California, United States
Full-time
As a Site Reliability Engineer (Contractor), you will be a hands-on contributor, focused on supporting and improving the reliability of our AWS cloud infrastructure. You will apply core SRE principl...Show more
Last updated: 15 hours ago • Promoted • New!
Site Reliability Engineer - Supercomputing

Site Reliability Engineer - Supercomputing

Xai • Palo Alto, California, United States
Full-time
AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer – Kubernetes

Site Reliability Engineer – Kubernetes

Theklicker • Palo Alto, CA, United States
Full-time
We are dedicated to being a one-stop solution for purchasing electronic products.With a focus on delivering the best user experience, theklicker empowers users to make informed purchasing decisions...Show more
Last updated: 7 days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

Id.me • Mountain View, California, United States
Full-time
Consumers can verify their identity with ID.Over 152 million users experience streamlined login and identity verification with ID. More than 600+ consumer brands use ID.Commerce Department and is ap...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

Paynearme • Cupertino, California, United States
Remote
Full-time
At PayNearMe, we’re on a mission to make paying and getting paid as simple as possible.We build innovative technology that transforms the way businesses and their customers experience payments.Our ...Show more
Last updated: 7 days ago • Promoted