Site Reliability Engineer

ZapierSan Francisco, CA, United States

5 hours ago

Job type

Full-time

Job description

About Zapier

We're humans who simply think computers should do more work.

At Zapier, we’re not just making software—we’re building a platform to help millions of businesses globally scale with automation and AI. Our mission is to make automation work for everyone by delivering products that delight our customers. You’ll collaborate with brilliant people, use the latest tools, and leverage the flexibility of remote work. Your work will directly fuel our customers’ success, and as they grow, so will you.

Site Reliability Engineer

Job Posted : October 1st, 2025

Location : Remote, NAMER (West Coast)

Zapier’s Internal Platform provides engineers with a reliable, frictionless foundation for building, shipping, and operating software. Our Reliability Platform team owns observability, incident response, and service ownership, and we’re hiring a Site Reliability Engineer to help strengthen Zapier’s reliability posture.

Want to learn more about working at Zapier?

We know we have a lot of competition for your skills. If you’re wondering what things would be like at Zapier, read on about :

Our Commitment to Applicants
Culture and Values at Zapier
Zapier Guide to Remote Work
Zapier Code of Conduct
Diversity and Inclusivity at Zapier

Why This Role Matters

This isn’t just an infrastructure or tooling role. We’re looking for an engineer who’s excited to get hands-on with Zapier’s reliability challenges. You’ll help improve how we observe our systems, detect and respond to incidents, and build the systems that make Zapier more resilient at scale.

About You

You’re an experienced engineer with 4+ years in systems, infrastructure, or backend software roles (SaaS, cloud-native environments preferred).

You thrive writing production-grade code — in Go, Python, or something equivalent.

You’ve worked with infrastructure-as-code (Terraform, or equivalent), cloud (AWS), and container orchestration (Kubernetes).

You have hands-on experience with observability (metrics, logging, dashboards, alerts) and can reason about instrumentation and alert design.

You enjoy solving complex systems challenges and finding ways to improve performance and reliability.

You’re comfortable jumping into incidents, diagnosing across telemetry, coordinating with teams, and contributing to postmortems.

You think proactively about reducing toil and automating repetitive work.

You’re comfortable influencing peers by suggesting better practices, reviewing designs, and driving small cross-team improvements.

You communicate clearly—whether in async docs, real-time discussions, or knowledge sharing with the team.

You align with Zapier’s values and thrive in a collaborative, remote-first environment.

You approach new tools and ideas with curiosity and openness—especially around AI in reliability workflows. You’ve experimented with AI tools (or are eager to learn) and see them as part of your everyday toolkit.

Things You’ll Do

Build and improve platform tooling that helps Zapier engineers observe and operate their services.

Partner with product teams to raise the bar on observability and incident response.

Operate and evolve core observability systems, including logging, metrics, alerting, and dashboards.

Participate in the team’s on-call rotation for owned services and contribute to Zapier’s broader incident response program by improving the processes, tooling, and practices we use to detect, respond, and learn.

Write code to automate operations, improve developer experience, and reduce manual toil.

Contribute to infrastructure reliability by working with AWS, Kubernetes, Terraform, and other core technologies.

Help shape observability and reliability best practices : review instrumentation designs, suggest improvements, and advocate for effective alerting.

Share knowledge through documentation, pairing, and mentoring.

Explore and pilot AI-augmented tools (e.g. debugging agents, alert correlation, query recommendations) to improve reliability workflows.

Our Stack & Tools

Cloud & Infra : AWS, Kubernetes, Redis, Kafka, Terraform

Observability : Grafana, Datadog, Opensearch, Prometheus, Sentry

Languages : Go, Python, TypeScript

CI / CD & Source Control : GitLab, ArgoCD

What Success Looks Like

You deliver reliable, maintainable improvements to Zapier’s reliability systems and tooling.

You improve how teams detect and resolve incidents by enhancing observability, standardizing tooling and processes, and contributing to effective response workflows.

You help product teams gain confidence in their services by guiding them toward better instrumentation and visibility.

You influence observability and reliability practices across teams—promoting a thoughtful, customer-focused approach to monitoring, alerting, and design decisions.

You connect reliability work to customer impact, helping your team focus on the improvements that matter most.

You grow through feedback and reflection, while contributing to a healthy, inclusive team culture—supporting peers, mentoring, and creating space for diverse perspectives.

You explore AI tools with curiosity and introduce practical uses such as reducing noise, speeding up debugging, or guiding better operational decisions.

How to Apply

At Zapier, we believe that diverse perspectives and experiences make us better. We're looking for the best fit for each of our roles, regardless of the type of companies in your background. We encourage you to apply even if your skills and experiences don’t exactly match the job description.

Zapier is an equal-opportunity employer and we're excited to work with talented and empathetic people of all identities. Zapier does not discriminate based on someone's identity in any aspect of hiring or employment as required by law and in line with our commitment to Diversity, Inclusion, Belonging and Equity.

Zapier prioritizes the security of our customers' information and is dedicated to adhering to all applicable data privacy laws.

Zapier is committed to inclusion. If reasonable accommodations are needed to participate in the job application or interview process, please contact jobs@zapier.com.

#J-18808-Ljbffr

Create a job alert for this search

Site Reliability Engineer • San Francisco, CA, United States

Related jobs

Promoted
New!

Site Reliability Engineer

Redwood Materials, Inc.San Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show moreLast updated: 5 hours ago

Promoted
New!

Site Reliability Engineer

criteoPalo Alto, CA, United States

Full-time

At Criteo we face challenging problems in the IT industry at scale.Our data is large and our systems require speed and complexity handling. We have about 40 petabytes in Hadoop storage and respond t...Show moreLast updated: 5 hours ago

Promoted
New!

Senior Site Reliability Engineer

LiveRampSan Francisco, CA, United States

Full-time

Join to apply for the Senior Site Reliability Engineer role at LiveRamp.LiveRamp is the data collaboration platform of choice for the world’s most innovative companies. A groundbreaking leader in co...Show moreLast updated: 5 hours ago

Promoted

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago

Promoted
New!

Site Reliability Engineer

Together AISan Francisco, CA, United States

Full-time

As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show moreLast updated: 5 hours ago

Promoted
New!

Principal Site Reliability Engineer

JPMorganChasePalo Alto, CA, United States

Full-time

Join a globally recognized financial organization and advance your profession to new heights by contributing to revolutionary projects. You've discovered the perfect environment to have a major impa...Show moreLast updated: 5 hours ago

Promoted

Site Reliability Engineer

PerplexitySan Francisco, CA, United States

Full-time

Perplexity is an AI-powered answer engine founded in December 2022 and growing rapidly as one of the world’s leading AI platforms. Perplexity has raised over $1B in venture investment from some of t...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

AlchemySan Francisco, CA, United States

Full-time

Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

VirtualVocationsConcord, California, United States

Full-time

A company is looking for a Mid-Sr.Site Reliability Engineer with a focus on on-prem Kubernetes / K8s.Key Responsibilities Manage and maintain on-premise containerized environments Deploy resources...Show moreLast updated: 30+ days ago

Promoted

Customer Reliability Engineer

VirtualVocationsFremont, California, United States

Full-time

A company is looking for a Customer Reliability Engineer III.Key Responsibilities Manage and resolve customer technical issues via support tickets and real-time interactions Act as a liaison bet...Show moreLast updated: 30+ days ago

Promoted
New!

Site Reliability Engineer

Redwood MaterialsSan Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling — keeping critical minerals in circulation and driving the energy transition.Founded in...Show moreLast updated: 5 hours ago

Promoted
New!

Site Reliability Engineer

xAIPalo Alto, CA, United States

Full-time

AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...Show moreLast updated: 5 hours ago

Promoted

Site Reliability Engineer

PrimerSan Francisco, CA, United States

Full-time

Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show moreLast updated: 30+ days ago

Promoted
New!

Site Reliability Engineer

Jobs via DiceRedwood City, CA, United States

Full-time

Dice is the leading career destination for tech experts at every stage of their careers.Our client, Kforce Technology Staffing, is seeking a Reliability Engineer in Redwood City, CA.Deliver high-le...Show moreLast updated: 5 hours ago

Promoted

Senior Site Reliability Engineer

HiveSan Francisco, CA, United States

Full-time

Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer II

Hinge HealthSan Francisco, CA, United States

Full-time

From scaling Kubernetes clusters to improving observability with Datadog, we build the tooling and automation that empower product teams to ship with confidence. Collaborate with engineering teams t...Show moreLast updated: 30+ days ago

Promoted
New!

Site Reliability Engineer

Bits to AtomsSan Francisco, CA, United States

Full-time

Site Reliability Engineer (SRE).You’ll work at the intersection of infrastructure, AI / ML systems, and mission-critical physical operations. You’ll collaborate directly with engineering, AI, and oper...Show moreLast updated: 5 hours ago

Promoted

Senior Site Reliability Engineer

VirtualVocationsOakland, California, United States

Full-time

A company is looking for a Senior Site Reliability Engineer.Key Responsibilities Design and implement infrastructure and automation scripts for AWS deployment and management Optimize and monitor...Show moreLast updated: 30+ days ago