Staff Site Reliability Engineer, ComputeCrusoe • San Francisco, California, United States

Staff Site Reliability Engineer, Compute

Crusoe • San Francisco, California, United States

30+ days ago

Job type

Full-time

Job description

Crusoe is building the World’s Favorite AI-first Cloud infrastructure company. We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Crusoe is redefining AI cloud infrastructure, with a mission to align the future of computing with the future of the climate. Our AI platform is recognized as the "gold standard" for reliability and performance. Our data centers are optimized for AI workloads and are powered by clean, renewable energy.

Be part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role :

At Crusoe, we are building the most sustainable, AI-first cloud infrastructure, and our Compute-focused Site Reliability Engineers are the backbone of that mission. This role is centered on supporting virtualization, hypervisor, and kernel-level performance for Crusoe’s compute infrastructure. You’ll play a vital role in deploying and optimizing bare-metal and virtualized compute platforms, ensuring performance, security, and scale for modern AI and HPC workloads.

What You'll Be Working On :

In this role, you will develop automation and observability tools to monitor Crusoe’s compute infrastructure, spanning from the kernel to orchestration layers. You will support and scale the company’s virtualization stack, including technologies such as KVM, QEMU, and other hypervisors. Collaborating with Linux kernel and hardware teams, you’ll help identify and resolve performance bottlenecks, driver issues, and optimize hardware offloads. A key focus will be on optimizing performance for AI and HPC workloads across CPU, GPU, and DPU / NIC resources. You will participate in root cause analysis for kernel crashes, hardware-software integration problems, and performance regressions, while also integrating hypervisor-level enhancements to improve guest VM reliability and workload isolation. The role involves tuning kernel subsystems such as the process scheduler, NUMA configuration, memory management, and interrupt handling. Additionally, you will work closely with platform teams to implement and validate support for emerging compute hardware, including SmartNICs, BlueField devices, and TPUs

What You’ll Bring to the Team :

8+ years of professional experience in SRE, Linux system engineering, or compute infrastructure roles.

Strong proficiency in Linux kernel internals, with exposure to scheduler, memory allocation, and driver subsystems.

Experience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMware.

Familiarity with SmartNICs / DPUs (e.g., NVIDIA CX6 / 7, BlueField-3) and kernel bypass techniques.

Expert-level skills in at least one programming language : C, Go, or Rust.

Experience with system-level debugging, including kdump, kexec, and kernel panic analysis.

Proficiency in Infrastructure as Code tooling and CI / CD practices for bare-metal or cloud infrastructure.

Strong understanding of compute scheduling, resource management, and high-throughput networking.

Benefits : Hybrid work schedule

Industry competitive pay

Restricted Stock Units in a fast growing, well-funded technology company

Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

Employer contributions to HSA accounts

Paid Parental Leave

Paid life insurance, short-term and long-term disability

Teladoc

401(k) with a 100% match up to 4% of salary

Generous paid time off and holiday schedule

Cell phone reimbursement

Tuition reimbursement

Subscription to the Calm app

MetLife Legal

Company paid commuter benefit; $50 per pay period

Compensation Range :

Compensation will be paid up to $250,000 per year + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex / gender, sexual preference / orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

Create a job alert for this search

Site Reliability Engineer • San Francisco, California, United States

Related jobs

Site Reliability Engineer

DevOps projects • Berkeley, CA, United States

Full-time

LMArena is an engineering-first startup redefining how the world evaluates large language models.Created in 2023 by UC Berkeley researchers, our neutral, community-driven benchmarking platform attr...Show more

Last updated: 4 days ago • Promoted

Senior Site Reliability Engineer, Compute

Crusoe • San Francisco, CA, United States

Full-time

Crusoe's mission is to accelerate the abundance of energy and intelligence.We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, spe...Show more

Last updated: 17 days ago • Promoted

Staff Site Reliability Engineer

Altana AI • San Francisco, CA, United States

Full-time

AI can be a powerful tool for good in the world – at Altana we apply AI to the world’s largest organized body of supply chain data to power a more resilient, more secure, and more sustainable model...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Neara • Palo Alto, CA, United States

Full-time

Job type : Full Time • Department : Backend Engineer • Work type : Remote.Archetype AI is developing the world's first AI platform to bring AI into the real world. Formed by an exceptionally high-calib...Show more

Last updated: 6 hours ago • Promoted • New!

Site Reliability Engineer

PsiQuantum • Palo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more

Last updated: 30+ days ago • Promoted

Senior / Staff Site Reliability Engineer, Compute

Fluidstack • San Francisco, CA, United States

Full-time

At Fluidstack, we’re building the infrastructure for abundant intelligence.We partner with top AI labs, governments, and enterprises - including Mistral, Poolside, Black Forest Labs, Meta, and more...Show more

Last updated: 30+ days ago • Promoted

Senior Staff Site Reliability Engineer - Platform

Icon Ventures • San Francisco, CA, United States

Full-time

At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.Our $1B+ learning platform serves tens of millions of students every month, includin...Show more

Last updated: 5 days ago • Promoted

Staff Site Reliability Engineer

Checkr • San Francisco, CA, United States

Full-time

Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show more

Last updated: 22 days ago • Promoted

Staff Engineer, Site Reliability

Zapier • San Francisco, CA, United States

Full-time

Zapier is building a platform to help millions of businesses globally scale with automation and AI.Our mission is to make automation work for everyone by delivering products that delight our custom...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Together AI • San Francisco, CA, United States

Full-time

As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show more

Last updated: 30+ days ago • Promoted

Staff Site Reliability Engineer

Redwood Materials, Inc. • San Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show more

Last updated: 3 days ago • Promoted

Senior Staff Site Reliability Engineer - Platform

Quizlet • San Francisco, CA, United States

Full-time

Last updated: 5 days ago • Promoted

Staff / Principal Site Reliability Engineer

The Resume Database • Redwood City, CA, United States

Full-time

Staff / Principal Site Reliability Engineer.Staff / Principal Site Reliability Engineer.You’ll architect scalable solutions, navigate complex technical challenges independently, and deliver results und...Show more

Last updated: 8 days ago • Promoted

Staff Site Reliability Engineer

Berkley Hunt • San Francisco, CA, United States

Full-time

Founder @ Berkley Hunt | Partnering with VC firms to build high-performing tech teams.Berkley Hunt has partnered with a Series B start up, we are seeking a highly skilled Infrastructure Engineer to...Show more

Last updated: 8 days ago • Promoted

Staff Site Reliability Engineer

Grindr • Palo Alto, CA, United States

Full-time

Staff Site Reliability Engineer.Get AI-powered advice on this job and more exclusive features.This range is provided by Grindr. Your actual pay will be based on your skills and experience — talk wit...Show more

Last updated: 8 days ago • Promoted

Site Reliability Engineer - Kubernetes Platform

Pantera Capital • Palo Alto, CA, United States

Full-time

AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...Show more

Last updated: 20 days ago • Promoted

Site Reliability Engineer

Rockwoods Inc • Pleasanton, CA, United States

Full-time

Note : Candidates must have relevant experience in Medical / Healthcare domains, this is mandatory.Senior SRE Engineer - Pleasanton, 5 days office. Primary work : 24x7 On-call support and setting up mo...Show more

Last updated: 30+ days ago • Promoted

Staff Site Reliability Engineer - Platform

Icon Ventures • San Francisco, CA, United States

Full-time

Last updated: 5 days ago • Promoted