Talent.com
Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader

Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader

AndiamoSan Francisco, CA, United States
1 day ago
Job type
  • Full-time
Job description

Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader

Sr Site Reliability Engineer / HPC – Pre-IPO Tech Leader

About The Role

We are seeking a highly skilled Senior Site Reliability Engineer (SRE) / High-Performance Computing (HPC) Engineer to design, build, and operate the large-scale infrastructure that powers a $2.5B pre-IPO technology company. Our systems run on massive distributed clusters, handling some of the most demanding workloads in cloud, AI, and data-driven computing.

In this role, you will be responsible for ensuring the reliability, scalability, and performance of mission-critical platforms. You will optimize HPC workloads, streamline CI / CD for large-scale clusters, and enable research and product teams to deliver innovations with speed and confidence. This is a hands-on position with the opportunity to influence architecture, lead reliability initiatives, and solve some of the hardest problems in distributed systems and performance engineering.

What You’ll Do

  • Design Reliable Infrastructure : Architect and maintain large-scale, distributed HPC and cloud-native systems with a focus on uptime, scalability, and resilience.
  • Optimize HPC Workloads : Tune scheduling, job orchestration, and performance for compute- and memory-intensive workloads (AI / ML, simulations, large-scale analytics).
  • Build Observability : Implement monitoring, logging, and alerting systems that provide full visibility into cluster and service health.
  • Automate Everything : Develop tooling and automation for provisioning, scaling, and recovery of critical systems.
  • Ensure Security & Compliance : Implement best practices for access control, encryption, and governance across HPC and cloud environments.
  • Collaborate Cross-Functionally : Work with engineering, research, and product teams to deliver reliable infrastructure for next-gen applications.
  • Incident Response : Lead troubleshooting, root cause analysis, and postmortems for high-severity incidents.

What We’re Looking For

  • Professional Experience : 7+ years in SRE, infrastructure engineering, or HPC roles with a proven track record of supporting large-scale distributed systems.
  • Technical Skills : Expertise in Linux systems, Python or Go, and infrastructure-as-code (Terraform, Ansible, or similar).
  • HPC Expertise : Strong knowledge of job schedulers (Slurm, Kubernetes, or Mesos), workload managers, and parallel / distributed computing.
  • Cloud & Hybrid : Hands-on experience with AWS, GCP, or Azure in combination with on-premises HPC clusters.
  • Observability : Proficiency with monitoring and logging frameworks (Prometheus, Grafana, ELK, OpenTelemetry).
  • Resilience Engineering : Experience with chaos engineering, failure testing, and disaster recovery planning.
  • Collaboration : Strong communication skills and the ability to work with research scientists, engineers, and operations teams.
  • Education : Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
  • Why Join

    This is an opportunity to join a pre-IPO technology leader valued at $2.5B, at a time of rapid growth and innovation. As a Senior SRE / HPC Engineer, you will shape the infrastructure that powers next-generation AI, analytics, and large-scale computing. You’ll solve some of the most complex reliability and performance challenges, collaborate with world-class teams, and play a key role in preparing the company for IPO and beyond.

    #J-18808-Ljbffr

    Create a job alert for this search

    Senior Site Reliability Engineer • San Francisco, CA, United States

    Related jobs
    • Promoted
    Staff Site Reliability Engineer

    Staff Site Reliability Engineer

    VirtualVocationsConcord, California, United States
    Full-time
    A company is looking for a Staff Site Reliability Engineer.Key Responsibilities Define and drive the strategic direction for SRE practices and reliability engineering Architect and implement com...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    VirtualVocationsSan Jose, California, United States
    Full-time
    A company is looking for a Site Reliability Engineer 1.Key Responsibilities Manage deployments of services to the GovCloud Monitor KPIs of services running in the GovCloud Author and maintain d...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer, Supply

    Senior Site Reliability Engineer, Supply

    MithrilSan Francisco, CA, United States
    Full-time
    Senior Site Reliability Engineer, Supply.Senior Site Reliability Engineer, Supply.Continue with Google Continue with Google. Senior Site Reliability Engineer, Supply.Senior Site Reliability Engineer...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantumPalo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Senior Director of Site Reliability Engineering

    Senior Director of Site Reliability Engineering

    VirtualVocationsHayward, California, United States
    Full-time
    A company is looking for a Senior Director of Site Reliability Engineering.Key Responsibilities Develop and execute the SRE vision and strategy for reliability and performance goals Lead and men...Show moreLast updated: 6 hours ago
    • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    JPMorganChasePalo Alto, CA, United States
    Full-time
    Join a globally recognized financial organization and advance your profession to new heights by contributing to revolutionary projects. You've discovered the perfect environment to have a major impa...Show moreLast updated: 1 day ago
    • Promoted
    Senior / Staff Site Reliability Engineer

    Senior / Staff Site Reliability Engineer

    CrusoeSan Francisco, CA, United States
    Full-time
    Crusoe Energy is on a mission to unlock value in stranded energy resources through the power of computation.We aim to align the long term interests of the climate with the future of global computin...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Rollbar, Inc.San Francisco, CA, United States
    Full-time
    Wikimedia Foundation is hiring a Senior Site Reliability Engineer (SRE) to join our Service Operations SRE team, where we take care of the infrastructure that runs wikipedia.The SRE team at Wikimed...Show moreLast updated: 1 day ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Citizen HealthSan Francisco, CA, United States
    Full-time
    Senior Site Reliability Engineer.Citizen Health was founded on the belief that having the right advocate is the single most important factor in achieving better care and outcomes.By uniquely combin...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer - Technical Lead

    Site Reliability Engineer - Technical Lead

    ZipRecruiterSan Francisco, CA, United States
    Full-time
    Veryon is a leading software and technology company that enables aviation teams around the world to improve efficiency and safety. Our products maximize uptime for aircraft maintenance teams through...Show moreLast updated: 8 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    VirtualVocationsConcord, California, United States
    Full-time
    A company is looking for a Senior Site Reliability Engineer (contractor).Key Responsibilities Design and manage infrastructure using Terraform and CloudFormation Define and maintain SLIs, SLOs, ...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    CheckrSan Francisco, CA, United States
    Full-time
    Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 1 day ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Loft OrbitalSan Francisco, CA, United States
    Full-time
    Loft Orbital is revolutionizing access to space by building reliable, shareable satellites that drastically reduce the time and complexity traditionally required to get to orbit.We operate satellit...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PacerProSan Francisco, CA, United States
    Full-time
    You’ll be joining the engineering team responsible for delivering PacerPro’s SaaS and on-premise solutions that orchestrate case data workflows and provide data driven legal insights for our client...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    HiveSan Francisco, CA, United States
    Full-time
    Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PrimerSan Francisco, CA, United States
    Full-time
    Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Senior Engineer, Site Reliability

    Senior Engineer, Site Reliability

    VirtualVocationsConcord, California, United States
    Full-time
    A company is looking for a Senior Engineer in Site Reliability Engineering for Digital Banking.Key Responsibilities Ensure the reliability, availability, and performance of applications in produc...Show moreLast updated: 8 hours ago
    • Promoted
    Senior Site Reliability Engineer Denver, Colorado, United States; San Francisco, California, Un[...]

    Senior Site Reliability Engineer Denver, Colorado, United States; San Francisco, California, Un[...]

    CheckrSan Francisco, CA, United States
    Full-time
    Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 1 day ago