Senior Site Reliability EngineerCitizen Health • San Francisco, California, United States

Senior Site Reliability Engineer

Citizen Health • San Francisco, California, United States

1 day ago

Job type

Full-time

Job description

Who We Are

Citizen Health was founded on the belief — shaped by firsthand lived experiences navigating healthcare — that having the right advocate is the single most important factor in achieving better care and outcomes. By uniquely combining data, AI, and community, Citizen is building a personalized AI advocate powered by patients' complete medical histories and data from thousands of other patients to generate personalized insights for clinical decisions and day-to-day challenges. Starting in rare and complex conditions, patients share their data in exchange for value, enabling biopharma and researchers with seamless access to patients and regulatory-grade data, shaving years off drug development for much-needed treatments.

Citizen is founded by experienced entrepreneurs with multiple successes under their belts and is funded by top-tier investors including 8VC, Transformation Capital, and Headline Ventures, among others. We are a mission-driven team excited to be building the future of consumer healthcare.

The Role

Citizen Health is seeking a Senior Site Reliability Engineer (SRE) to ensure the resilience, performance, and availability of our AI-powered, patient-centric healthcare platform.

In this hands-on, high-impact role, you will apply software engineering principles to operational challenges—designing and maintaining reliable systems that scale, fail gracefully, and recover quickly. You'll work cross-functionally to establish SLOs / SLIs, implement robust observability, establish a metrics-driven approach to service performance, and drive improvements in incident response, fault tolerance, and service reliability.

If you're passionate about building systems that stay up, scale well, and recover fast — and you thrive on solving reliability challenges in modern cloud-native environments — we’d love to talk to you.

Responsibilities

Reliability Engineering & Observability

Define and measure service reliability through SLIs, SLOs, and error budgets.

Implement and operate observability tooling (e.g., NewRelic, Prometheus) across cloud and Kubernetes environments.

Analyze logs, traces, and metrics to surface actionable insights and improve system health.

Perform capacity planning, load testing, and performance profiling and tuning to support scale and reliability, and to optimize system performance.

Resilience & Automation

Design and maintain resilient, self-healing infrastructure in AWS and Kubernetes (EKS).

Conduct chaos engineering experiments, failure mode analysis, and disaster recovery drills to proactively identify and fix weaknesses.

Build automation to reduce toil, improve reliability metrics (latency, uptime, error rates, MTTD, and MTTR), and prevent recurrence of incidents.

Engineer infrastructure for fault tolerance, auto-scaling, and graceful degradation.

Incident Response & Operations

Drive incident response efforts, manage on-call rotations, and coordinate resolution of production outages.

Conduct root cause analyses and blameless postmortems to drive learning and resilience.

Continuously improve key reliability metrics such as latency, uptime, error rates, and availability.

Collaborate with security, platform, and DevOps teams to ensure high-availability and production-readiness of services.

Cross-Team Collaboration & Culture

Collaborate with engineering teams during design and architecture phases to assess and mitigate reliability risks.

Support progressive delivery strategies including feature flags and canary deployments.

Champion SRE principles and practices, helping build a culture of resilience and shared ownership.

Stay current with emerging practices in cloud reliability, observability, and SRE tooling.

Who You Are

You are a hands-on individual who thrives in fast-paced, high-stakes environments where reliability is mission-critical. You bring deep experience operating distributed systems at scale, with a strong foundation in cloud-native infrastructure, Kubernetes, and observability.

You think like a software engineer but focus like an operator, using code to solve operational challenges. You’re driven by making systems more resilient, reducing downtime, and building fault-tolerant architectures that scale with user demand. You value data-driven decision making, and see SLIs / SLOs, incident postmortems, and continuous improvement as essential tools, not checkboxes.

You have a strong sense of ownership, thrive under pressure, and believe the best systems are the ones that heal themselves. You collaborate closely across teams, care deeply about the end-user experience, and are always looking for better ways to keep complex systems running smoothly.

Must-Have Skills

5+ years in Site Reliability, DevOps, or Infrastructure Engineering roles.

Strong software engineering skills in languages such as Python, Go, or Bash.

Deep expertise operating production systems in AWS, with additional experience in GCP or Azure.

Proven experience operating and scaling Kubernetes (EKS) in production.

Experience implementing GitOps with FluxCD (or similar).

Hands-on experience implementing observability, auto-scaling, and self-healing systems.

Strong foundation in networking, load balancing, CDN, and container orchestration.

Solid knowledge of performance optimization techniques (e.g. profiling, caching, tuning)

Strong incident response background, including postmortems and SLO / SLI development, driving improvements through data and analysis.

Passion for reliability, automation, operational excellence, and building systems patients and clinicians can trust.

Excellent communication and collaboration skills

A methodical approach to problem-solving and system design

Preferred Skills

Deep understanding of distributed systems, fault tolerance, and failure modes.

Experience implementing chaos engineering practices (Gremlin, Chaos Mesh, etc.)

Familiarity with multi-region, multi-cloud reliability strategies

Experience with service meshes (e.g., Istio) and resilience patterns

Solid background in security, compliance, and operational hardening (HIPAA, SOC 2)

Experience in capacity planning, scaling, and disaster recovery design

Why Join Us?

At Citizen Health, you will be part of an ambitious mission to build innovative AI-driven solutions that redefine the consumer healthcare experience and improve outcomes for millions. We value curiosity, creativity, and collaboration, and we believe in empowering every team member to have an impact. Here, you’ll find :

The chance to work on pioneering AI technologies and solve complex problems that push the boundaries of what’s possible in healthcare

The freedom to lead your projects and shape the company’s direction in a mission-driven environment where your work makes a real impact

A fast-paced, high-growth setting that brings expanding opportunities and clear paths for career progression

A supportive culture of learning, experimentation, and innovation, where new ideas are encouraged and explored

A commitment to empathy and diversity, recognizing that our greatest strengths come from our varied perspectives and experiences

Regular team activities and knowledge sharing, all within a culture that prioritizes well-being and connection

Additional Perks

Competitive salary + equity package

Comprehensive health, dental, and vision insurance

Unlimited paid time off, including a generous parental leave

Flexible hybrid work environment

Don't meet every qualification? No worries! We believe that passion, curiosity, and the right mindset are just as important as a checklist of skills. If you’re excited about what we’re building, we encourage you to apply.

Our Commitment to Diversity & Inclusion : Citizen Health is proud to be an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We welcome applicants of all backgrounds, identities, and experiences. Everyone deserves an equal chance to contribute and grow here — regardless of race, gender identity, sexual orientation, religion, national origin, age, disability, or veteran status.

Create a job alert for this search

Senior Site Reliability Engineer • San Francisco, California, United States

Related jobs

Site Reliability Engineer

Together AI • San Francisco, CA, United States

Full-time

As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show more

Last updated: 16 days ago • Promoted

Senior / Lead Site Reliability Engineer - Federal

C3.ai, Inc. • Redwood City, CA, United States

Full-time

C3 AI (NYSE : AI), is the Enterprise AI application software company.C3 AI delivers a family of fully integrated products including the C3 Agentic AI Platform, an end-to-end platform for developing,...Show more

Last updated: 16 days ago • Promoted

Principal Site Reliability Engineer

Fortinet • Santa Clara, CA, United States

Full-time

At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

Sustainable Talent • Santa Clara, CA, United States

Full-time

Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based on...Show more

Last updated: 16 days ago • Promoted

Site Reliability Engineer (SRE)

SS&C Technologies • San Francisco, CA, United States

Full-time

As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000+ employees in 35 countries.Some 20,000 financial se...Show more

Last updated: 16 days ago • Promoted

Site Reliability Engineer

PsiQuantum • Palo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Rethink recruit • San Francisco, CA, United States

Full-time

Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show more

Last updated: 16 days ago • Promoted

Site Reliability Engineer

Runloop AI, Inc • San Francisco, CA, United States

Full-time

Last updated: 16 days ago • Promoted

Site Reliability Engineer

Insight Global • Santa Clara, CA, United States

Full-time

Insight Global is looking for a seasoned SRE to join one of our largest technology clients' multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working...Show more

Last updated: 16 days ago • Promoted

Site Reliability Engineer

Fortinet • Sunnyvale, CA, United States

Full-time

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Redwood Materials • San Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling.We are seeking a highly skilled and motivated Site Reliability Engineer to collect requ...Show more

Last updated: 16 days ago • Promoted

Site Reliability Engineer - Openstack

Fortinet • Sunnyvale, CA, United States

Full-time

Fortinet is recruiting a Site Reliability Engineer- OPENSTACK to join our FortiStack team.This team is responsible for the management, operation and continued development of our Openstack-based pri...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

PSI Quantum • Palo Alto, CA, United States

Full-time

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer, Payments - USDS

Tik Tok • San Jose, CA, United States

Full-time

Team Intro : The Global Payment team of the US Tech Service department of TikTok provides all-round payment solutions for the company's USA products, overseas commercialization, and the company's ov...Show more

Last updated: 7 days ago • Promoted

Senior Site Reliability Engineer (Senior SRE)

Ciroos • Pleasanton, CA, United States

Full-time

Ciroos (pronounced "Sai rose") is a seed-stage startup founded in February 2025 by a team of experienced executives and distinguished engineers with deep expertise in observability, AI, distributed...Show more

Last updated: 16 days ago • Promoted

Site Reliability Engineer

Replit • Foster City, CA, United States

Full-time

Replit is the agentic software creation platform that enables anyone to build applications using natural language.With millions of users worldwide and over 500,000 business users, Replit is democra...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

AppOmni • San Francisco, CA, United States

Full-time

AppOmni, a leader in SaaS Security, helps customers achieve secure productivity with their applications.Security teams and owners can quickly detect and mitigate threats using unmatched depth of pr...Show more

Last updated: 11 days ago • Promoted

Site Reliability Engineer I

Prosper.com • San Francisco, CA, United States

Full-time

As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show more

Last updated: 16 days ago • Promoted