Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader
Sr Site Reliability Engineer / HPC – Pre-IPO Tech Leader
About The Role
We are seeking a highly skilled Senior Site Reliability Engineer (SRE) / High-Performance Computing (HPC) Engineer to design, build, and operate the large-scale infrastructure that powers a $2.5B pre-IPO technology company. Our systems run on massive distributed clusters, handling some of the most demanding workloads in cloud, AI, and data-driven computing.
In this role, you will be responsible for ensuring the reliability, scalability, and performance of mission-critical platforms. You will optimize HPC workloads, streamline CI / CD for large-scale clusters, and enable research and product teams to deliver innovations with speed and confidence. This is a hands-on position with the opportunity to influence architecture, lead reliability initiatives, and solve some of the hardest problems in distributed systems and performance engineering.
What You’ll Do
- Design Reliable Infrastructure : Architect and maintain large-scale, distributed HPC and cloud-native systems with a focus on uptime, scalability, and resilience.
- Optimize HPC Workloads : Tune scheduling, job orchestration, and performance for compute- and memory-intensive workloads (AI / ML, simulations, large-scale analytics).
- Build Observability : Implement monitoring, logging, and alerting systems that provide full visibility into cluster and service health.
- Automate Everything : Develop tooling and automation for provisioning, scaling, and recovery of critical systems.
- Ensure Security & Compliance : Implement best practices for access control, encryption, and governance across HPC and cloud environments.
- Collaborate Cross-Functionally : Work with engineering, research, and product teams to deliver reliable infrastructure for next-gen applications.
- Incident Response : Lead troubleshooting, root cause analysis, and postmortems for high-severity incidents.
What We’re Looking For
Professional Experience : 7+ years in SRE, infrastructure engineering, or HPC roles with a proven track record of supporting large-scale distributed systems.Technical Skills : Expertise in Linux systems, Python or Go, and infrastructure-as-code (Terraform, Ansible, or similar).HPC Expertise : Strong knowledge of job schedulers (Slurm, Kubernetes, or Mesos), workload managers, and parallel / distributed computing.Cloud & Hybrid : Hands-on experience with AWS, GCP, or Azure in combination with on-premises HPC clusters.Observability : Proficiency with monitoring and logging frameworks (Prometheus, Grafana, ELK, OpenTelemetry).Resilience Engineering : Experience with chaos engineering, failure testing, and disaster recovery planning.Collaboration : Strong communication skills and the ability to work with research scientists, engineers, and operations teams.Education : Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.Why Join
This is an opportunity to join a pre-IPO technology leader valued at $2.5B, at a time of rapid growth and innovation. As a Senior SRE / HPC Engineer, you will shape the infrastructure that powers next-generation AI, analytics, and large-scale computing. You’ll solve some of the most complex reliability and performance challenges, collaborate with world-class teams, and play a key role in preparing the company for IPO and beyond.
#J-18808-Ljbffr