Senior Site Reliability Engineer, Compute

CrusoeSan Francisco, CA, United States

2 days ago

Job type

Full-time

Job description

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role :

At Crusoe, we are building the most sustainable, AI-first cloud infrastructure, and our Compute-focused Site Reliability Engineers are the backbone of that mission. This role is centered on supporting virtualization, hypervisor, and kernel-level performance for Crusoe’s compute infrastructure. You’ll play a vital role in deploying and optimizing bare-metal and virtualized compute platforms, ensuring performance, security, and scale for modern AI and HPC workloads.

What You’ll Be Working On :

In this role, you will develop automation and observability tools to monitor Crusoe’s compute infrastructure, spanning from the kernel to orchestration layers. You will support and scale the company’s virtualization stack, including technologies such as KVM, QEMU, and other hypervisors. Collaborating with Linux kernel and hardware teams, you’ll help identify and resolve performance bottlenecks, driver issues, and optimize hardware offloads. A key focus will be on optimizing performance for AI and HPC workloads across CPU, GPU, and DPU / NIC resources. You will participate in root cause analysis for kernel crashes, hardware‑software integration problems, and performance regressions, while also integrating hypervisor‑level enhancements to improve guest VM reliability and workload isolation. The role involves tuning kernel subsystems such as the process scheduler, NUMA configuration, memory management, and interrupt handling. Additionally, you will work closely with platform teams to implement and validate support for emerging compute hardware, including SmartNICs, BlueField devices, and TPUs.

What You’ll Bring to the Team :

8+ years of professional experience in Compute SRE, Linux system engineering, or compute infrastructure roles.
Strong proficiency in Linux kernel internals, with exposure to scheduler, memory allocation, and driver subsystems.
Experience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMware.
Familiarity with SmartNICs / DPUs (e.g., NVIDIA CX6 / 7, BlueField-3) and kernel bypass techniques.
Expert‑level skills in at least one programming language : Go, C or Rust.
Experience with system‑level debugging, including kdump, kexec, and kernel panic analysis.
Proficiency in Infrastructure as Code tooling and CI / CD practices for bare‑metal or cloud infrastructure.
Strong understanding of compute scheduling, resource management, and high‑throughput networking.

Benefits :

Industry competitive pay

Restricted Stock Units in a fast growing, well‑funded technology company

Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

Employer contributions to HSA accounts

Paid Parental Leave

Paid life insurance, short‑term and long‑term disability

Teladoc

401(k) with a 100% match up to 4% of salary

Generous paid time off and holiday schedule

Cell phone reimbursement

Tuition reimbursement

Subscription to the Calm app

MetLife Legal

Company paid commuter benefit; $300 / month

Compensation Range :

Compensation will be paid in the range of $172,000 - $209,000 a year + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex / gender, sexual preference / orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

#J-18808-Ljbffr

Create a job alert for this search

Senior Site Reliability Engineer • San Francisco, CA, United States

Related jobs

Promoted

Senior Site Reliability Engineer - Managed Kubernetes

LambdaSan Francisco, CA, United States

Full-time

Senior Site Reliability Engineer - Managed Kubernetes.Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambdas mission is to make compute as ubiqui...Show moreLast updated: 7 days ago

Promoted

Senior Site Reliability Engineer

NVIDIASanta Clara, CA, United States

Full-time

NVIDIA is looking for a Senior Site Reliability Engineer to work in IPP (Infrastructure, Planning and Process).IPP is a global organization within NVIDIA. This group works with various other groups ...Show moreLast updated: 7 days ago

Promoted

Site Reliability Engineer

Syntricate TechnologiesSan Jose, CA, United States

Full-time

Extensive experience working with linux flavors like rhel / centos os, shells, filesystems and utilities.Knowledge of distributed computing and experience working with container orchestration framewo...Show moreLast updated: 7 days ago

Promoted

Senior Site Reliability Engineer, BCM - DGX Cloud

NVIDIASanta Clara, CA, United States

Full-time

Senior Site Reliability Engineer, BCM - DGX Cloud page is loaded## Senior Site Reliability Engineer, BCM - DGX Cloudlocations : US, CA, Santa Clara : US, Remotetime type : Full timeposted on : Posted Y...Show moreLast updated: 7 days ago

Promoted

Senior Site Reliability Engineer

Sustainable TalentSanta Clara, CA, United States

Full-time

Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based on...Show moreLast updated: 7 days ago

Promoted

Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIASanta Clara, CA, United States

Full-time

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination of...Show moreLast updated: 7 days ago

Promoted

Senior Site Reliability Engineer Cloud Platform

ZillizRedwood City, CA, United States

Full-time

Zilliz is a fast-growing startup developing the industry's leading vector database company for enterprise-grade AI.Founded by the engineers behind Milvus, the world's most popular open-source vecto...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago

Promoted

Senior / Staff Site Reliability Engineer, Compute

FluidstackSan Francisco, CA, United States

Full-time

Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises.Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more. Our team is small, highly motivate...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability EngineerNovato, California, United States

2KNovato, CA, United States

Full-time

Senior Site Reliability Engineer.Novato, California, United States.On-Call Requirement : Yes (Periodic Rotation).K is headquartered in Novato, California and is a wholly owned label of Take-Two Inte...Show moreLast updated: 6 days ago

Promoted

Senior Site Reliability Engineer - Storage

NVIDIASanta Clara, CA, United States

Full-time

Senior Site Reliability Engineer - Storage page is loaded.Senior Site Reliability Engineer - Storage.Apply locations US, CA, Santa Clara time type Full time posted on Posted 3 Days Ago job requisit...Show moreLast updated: 7 days ago

Promoted

Senior Site Reliability Engineer

ZooxSan Mateo, CA, United States

Full-time

Zoox is looking for a platform / site reliability engineer who will be responsible for measuring and maintaining the uptime of the many services critical to the development process for autonomous veh...Show moreLast updated: 5 days ago

Promoted

Site Reliability Engineer - Supercomputing

XaiSan Francisco, CA, United States

Full-time

Site Reliability Engineer - Supercomputing.We are seeking a talented Site Reliability Engineer (SRE) to join our SuperComputing team. In this role, you'll ensure the reliability, scalability, and pe...Show moreLast updated: 1 day ago

Promoted

Senior Site Reliability Engineer

LanceDBSan Francisco, CA, United States

Full-time

LanceDB is a developer-friendly, open-source data lake for multimodal AI.From hyper-scalable vector search to advanced retrieval for RAG, from streaming training data to interactive exploration of ...Show moreLast updated: 8 days ago

Promoted

Senior Site Reliability Engineer

Citizen HealthSan Francisco, CA, United States

Full-time

Senior Site Reliability Engineer.Citizen Health was founded on the belief that having the right advocate is the single most important factor in achieving better care and outcomes.By uniquely combin...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer - Managed Kubernetes (Senior)

LambdaSan Francisco, CA, United States

Full-time

We're here to help the smartest minds on the planet build Superintelligence.The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with the...Show moreLast updated: 1 day ago

Promoted

Senior Site Reliability Engineer

AppOmniSan Francisco, CA, United States

Full-time

AppOmni, a leader in SaaS Security, helps customers achieve secure productivity with their applications.Security teams and owners can quickly detect and mitigate threats using unmatched depth of pr...Show moreLast updated: 2 days ago

Promoted

Senior Site Reliability Engineer - DGX Cloud

NVIDIASanta Clara, CA, United States

Full-time