Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader

AndiamoSan Francisco, CA, United States

1 day ago

Job type

Full-time

Job description

Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader

Sr Site Reliability Engineer / HPC – Pre-IPO Tech Leader

About The Role

We are seeking a highly skilled Senior Site Reliability Engineer (SRE) / High-Performance Computing (HPC) Engineer to design, build, and operate the large-scale infrastructure that powers a $2.5B pre-IPO technology company. Our systems run on massive distributed clusters, handling some of the most demanding workloads in cloud, AI, and data-driven computing.

In this role, you will be responsible for ensuring the reliability, scalability, and performance of mission-critical platforms. You will optimize HPC workloads, streamline CI / CD for large-scale clusters, and enable research and product teams to deliver innovations with speed and confidence. This is a hands-on position with the opportunity to influence architecture, lead reliability initiatives, and solve some of the hardest problems in distributed systems and performance engineering.

What You’ll Do

Design Reliable Infrastructure : Architect and maintain large-scale, distributed HPC and cloud-native systems with a focus on uptime, scalability, and resilience.
Optimize HPC Workloads : Tune scheduling, job orchestration, and performance for compute- and memory-intensive workloads (AI / ML, simulations, large-scale analytics).
Build Observability : Implement monitoring, logging, and alerting systems that provide full visibility into cluster and service health.
Automate Everything : Develop tooling and automation for provisioning, scaling, and recovery of critical systems.
Ensure Security & Compliance : Implement best practices for access control, encryption, and governance across HPC and cloud environments.
Collaborate Cross-Functionally : Work with engineering, research, and product teams to deliver reliable infrastructure for next-gen applications.
Incident Response : Lead troubleshooting, root cause analysis, and postmortems for high-severity incidents.

What We’re Looking For

Professional Experience : 7+ years in SRE, infrastructure engineering, or HPC roles with a proven track record of supporting large-scale distributed systems.

Technical Skills : Expertise in Linux systems, Python or Go, and infrastructure-as-code (Terraform, Ansible, or similar).

HPC Expertise : Strong knowledge of job schedulers (Slurm, Kubernetes, or Mesos), workload managers, and parallel / distributed computing.

Cloud & Hybrid : Hands-on experience with AWS, GCP, or Azure in combination with on-premises HPC clusters.

Observability : Proficiency with monitoring and logging frameworks (Prometheus, Grafana, ELK, OpenTelemetry).

Resilience Engineering : Experience with chaos engineering, failure testing, and disaster recovery planning.

Collaboration : Strong communication skills and the ability to work with research scientists, engineers, and operations teams.

Education : Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.

Why Join

This is an opportunity to join a pre-IPO technology leader valued at $2.5B, at a time of rapid growth and innovation. As a Senior SRE / HPC Engineer, you will shape the infrastructure that powers next-generation AI, analytics, and large-scale computing. You’ll solve some of the most complex reliability and performance challenges, collaborate with world-class teams, and play a key role in preparing the company for IPO and beyond.

#J-18808-Ljbffr

Create a job alert for this search

Senior Site Reliability Engineer • San Francisco, CA, United States

Related jobs

Promoted

Staff Site Reliability Engineer

VirtualVocationsConcord, California, United States

Full-time

A company is looking for a Staff Site Reliability Engineer.Key Responsibilities Define and drive the strategic direction for SRE practices and reliability engineering Architect and implement com...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

VirtualVocationsSan Jose, California, United States

Full-time

A company is looking for a Site Reliability Engineer 1.Key Responsibilities Manage deployments of services to the GovCloud Monitor KPIs of services running in the GovCloud Author and maintain d...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer, Supply

MithrilSan Francisco, CA, United States

Full-time

Senior Site Reliability Engineer, Supply.Senior Site Reliability Engineer, Supply.Continue with Google Continue with Google. Senior Site Reliability Engineer, Supply.Senior Site Reliability Engineer...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago

Promoted
New!

Senior Director of Site Reliability Engineering

VirtualVocationsHayward, California, United States

Full-time

A company is looking for a Senior Director of Site Reliability Engineering.Key Responsibilities Develop and execute the SRE vision and strategy for reliability and performance goals Lead and men...Show moreLast updated: 6 hours ago

Promoted

Principal Site Reliability Engineer

JPMorganChasePalo Alto, CA, United States

Full-time

Join a globally recognized financial organization and advance your profession to new heights by contributing to revolutionary projects. You've discovered the perfect environment to have a major impa...Show moreLast updated: 1 day ago

Promoted

Senior / Staff Site Reliability Engineer

CrusoeSan Francisco, CA, United States

Full-time

Crusoe Energy is on a mission to unlock value in stranded energy resources through the power of computation.We aim to align the long term interests of the climate with the future of global computin...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

Rollbar, Inc.San Francisco, CA, United States

Full-time

Wikimedia Foundation is hiring a Senior Site Reliability Engineer (SRE) to join our Service Operations SRE team, where we take care of the infrastructure that runs wikipedia.The SRE team at Wikimed...Show moreLast updated: 1 day ago

Promoted

Senior Site Reliability Engineer

Citizen HealthSan Francisco, CA, United States

Full-time

Senior Site Reliability Engineer.Citizen Health was founded on the belief that having the right advocate is the single most important factor in achieving better care and outcomes.By uniquely combin...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer - Technical Lead

ZipRecruiterSan Francisco, CA, United States

Full-time

Veryon is a leading software and technology company that enables aviation teams around the world to improve efficiency and safety. Our products maximize uptime for aircraft maintenance teams through...Show moreLast updated: 8 days ago

Promoted

Senior Site Reliability Engineer

VirtualVocationsConcord, California, United States

Full-time

A company is looking for a Senior Site Reliability Engineer (contractor).Key Responsibilities Design and manage infrastructure using Terraform and CloudFormation Define and maintain SLIs, SLOs, ...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

CheckrSan Francisco, CA, United States

Full-time

Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 1 day ago

Promoted

Senior Site Reliability Engineer

Loft OrbitalSan Francisco, CA, United States

Full-time

Loft Orbital is revolutionizing access to space by building reliable, shareable satellites that drastically reduce the time and complexity traditionally required to get to orbit.We operate satellit...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

PacerProSan Francisco, CA, United States

Full-time

You’ll be joining the engineering team responsible for delivering PacerPro’s SaaS and on-premise solutions that orchestrate case data workflows and provide data driven legal insights for our client...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

HiveSan Francisco, CA, United States

Full-time

Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

PrimerSan Francisco, CA, United States

Full-time

Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show moreLast updated: 30+ days ago

Promoted
New!

Senior Engineer, Site Reliability

VirtualVocationsConcord, California, United States

Full-time

A company is looking for a Senior Engineer in Site Reliability Engineering for Digital Banking.Key Responsibilities Ensure the reliability, availability, and performance of applications in produc...Show moreLast updated: 8 hours ago

Promoted

Senior Site Reliability Engineer Denver, Colorado, United States; San Francisco, California, Un[...]

CheckrSan Francisco, CA, United States

Full-time