Senior Site Reliability Engineer, ComputeEpoch Biodesign • San Francisco, CA, United States

Senior Site Reliability Engineer, Compute

Epoch Biodesign • San Francisco, CA, United States

2 days ago

Job type

Full-time

Job description

Location

San Francisco, CA - US

Employment Type

Full time

Location Type

On-site

Department

Cloud Engineering

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role :

At Crusoe, we are building the most sustainable, AI-first cloud infrastructure, and our Compute-focused Site Reliability Engineers are the backbone of that mission. This role is centered on supporting virtualization, hypervisor, and kernel-level performance for Crusoe’s compute infrastructure. You’ll play a vital role in deploying and optimizing bare-metal and virtualized compute platforms, ensuring performance, security, and scale for modern AI and HPC workloads.

What You'll Be Working On :

In this role, you will develop automation and observability tools to monitor Crusoe’s compute infrastructure, spanning from the kernel to orchestration layers. You will support and scale the company’s virtualization stack, including technologies such as KVM, QEMU, and other hypervisors. Collaborating with Linux kernel and hardware teams, you’ll help identify and resolve performance bottlenecks, driver issues, and optimize hardware offloads. A key focus will be on optimizing performance for AI and HPC workloads across CPU, GPU, and DPU / NIC resources. You will participate in root cause analysis for kernel crashes, hardware-software integration problems, and performance regressions, while also integrating hypervisor-level enhancements to improve guest VM reliability and workload isolation. The role involves tuning kernel subsystems such as the process scheduler, NUMA configuration, memory management, and interrupt handling. Additionally, you will work closely with platform teams to implement and validate support for emerging compute hardware, including SmartNICs, BlueField devices, and TPUs

What You’ll Bring to the Team :

8+ years of professional experience in Compute SRE, Linux system engineering, or compute infrastructure roles.

Strong proficiency in Linux kernel internals, with exposure to scheduler, memory allocation, and driver subsystems.

Experience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMware.

Familiarity with SmartNICs / DPUs (e.g., NVIDIA CX6 / 7, BlueField-3) and kernel bypass techniques.

Expert-level skills in at least one programming language : Go, C or Rust.

Experience with system-level debugging, including kdump, kexec, and kernel panic analysis.

Proficiency in Infrastructure as Code tooling and CI / CD practices for bare-metal or cloud infrastructure.

Strong understanding of compute scheduling, resource management, and high-throughput networking.

Benefits :

Industry competitive pay

Restricted Stock Units in a fast growing, well-funded technology company

Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

Employer contributions to HSA accounts

Paid Parental Leave

Paid life insurance, short-term and long-term disability

Teladoc

401(k) with a 100% match up to 4% of salary

Generous paid time off and holiday schedule

Cell phone reimbursement

Tuition reimbursement

Subscription to the Calm app

MetLife Legal

Company paid commuter benefit; $300 / month

Compensation Range :

Compensation will be paid in the range of $172,000 - $209,000 a year + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex / gender, sexual preference / orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

#J-18808-Ljbffr

Create a job alert for this search

Senior Site Reliability Engineer • San Francisco, CA, United States

Related jobs

Senior Site Reliability Engineer

Chainlink Labs • San Francisco, CA, United States

Full-time

Chainlink Labs is the primary contributing developer of Chainlink, the decentralized computing platform powering the verifiable web. Chainlink is the industry-standard platform for providing access ...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer - ML Infra & Kubernetes

Baseten • San Francisco, CA, United States

Full-time

A dynamic AI infrastructure company in San Francisco is seeking a motivated Site Reliability Engineer to build robust systems ensuring scalability and reliability. Candidates should possess a degree...Show more

Last updated: 21 hours ago • Promoted • New!

Senior Site Reliability Engineer – Platform

Icon Ventures • San Francisco, CA, United States

Full-time

At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.We blend cognitive science with machine learning to personalize and enhance the lear...Show more

Last updated: 9 days ago • Promoted

Senior Site Reliability Engineer

The Recruiting Guy • San Francisco, CA, United States

Full-time

Be among the first 25 applicants.This range is provided by The Recruiting Guy.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Senior Cloud Infra...Show more

Last updated: 9 days ago • Promoted

Site Reliability Engineer

Latent • San Francisco, CA, United States

Full-time

Location : San Francisco, CA (5 Days In-Office).You are the infrastructure expert who enables our rapid product development and guarantees. AI platform for major health systems.Your focus on operatio...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer, Compute

Roblox • San Mateo, California, USA

Full-time

The Infrastructure Compute Site Reliability Engineering (SRE) teams mission is to own and manage the successful operation of our underlying cell infrastructure system along with elements of service...Show more

Last updated: 4 days ago • Promoted

Senior / Staff Site Reliability Engineer, Compute

Fluidstack • San Francisco, CA, United States

Full-time

At Fluidstack, we’re building the infrastructure for abundant intelligence.We partner with top AI labs, governments, and enterprises - including Mistral, Poolside, Black Forest Labs, Meta, and more...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

Alembic • San Francisco, CA, United States

Full-time

We’re looking for an experienced.Site Reliability Engineer (SRE).You’ll partner with engineers and data scientists to build, automate, and maintain the infrastructure that powers our core platform—...Show more

Last updated: 21 hours ago • Promoted • New!

CloudDevs : Senior Site Reliability Engineer (SRE)

Breakout Tools • San Francisco, CA, United States

Full-time

CloudDevs works with fast-moving, venture-backed startups across the US.We’re building a pool of world-class Site Reliability Engineers for current roles and for upcoming opportunities.You will eit...Show more

Last updated: 21 hours ago • Promoted • New!

Site Reliability Engineer

Together AI • San Francisco, CA, United States

Full-time

As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

Loft Orbital • San Francisco, CA, United States

Full-time

Loft Orbital is revolutionizing access to space by building reliable, shareable satellites that drastically reduce the time and complexity traditionally required to get to orbit.We operate satellit...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

Alembic Technologies • San Francisco, CA, United States

Full-time

Senior Site Reliability Engineer.This range is provided by Alembic Technologies.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.We’re looking fo...Show more

Last updated: 8 days ago • Promoted

Senior Site Reliability Engineer

Checkr • San Francisco, CA, United States

Full-time

Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show more

Last updated: 9 days ago • Promoted

Senior Site Reliability Engineer

Hive • San Francisco, CA, United States

Full-time

Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

Circle • San Francisco, CA, United States

Full-time

Senior Site Reliability Engineer at Circle.Circle is a financial technology company at the epicenter of the emerging internet of money. Our infrastructure—including USDC, a blockchain‑based dollar—h...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

AppOmni • San Francisco, CA, United States

Full-time

AppOmni, a leader in SaaS Security, helps customers achieve secure productivity with their applications.Security teams and owners can quickly detect and mitigate threats using unmatched depth of pr...Show more

Last updated: 12 days ago • Promoted

Site Reliability Engineer

Writemed • San Francisco, CA, United States

Full-time

Would you like to join one of the fastest-growing organizations with a goal of using the latest AI, GenAI, LLM, Cloud, and Digital Technologies to advance drug development and improve patient care ...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader

Andiamo • San Francisco, CA, United States

Full-time

Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader.We are seeking a highly skilled Senior DevOps Engineer to drive the automation, scalability, and reliability of the infrastructure power...Show more

Last updated: 21 hours ago • Promoted • New!