Staff Site Reliability Engineer, ComputeCrusoe • San Francisco, California, United States

Staff Site Reliability Engineer, Compute

Crusoe • San Francisco, California, United States

30+ days ago

Job type

Full-time

Job description

Crusoe is building the World’s Favorite AI-first Cloud infrastructure company. We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Crusoe is redefining AI cloud infrastructure, with a mission to align the future of computing with the future of the climate. Our AI platform is recognized as the "gold standard" for reliability and performance. Our data centers are optimized for AI workloads and are powered by clean, renewable energy.

Be part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role :

At Crusoe, we are building the most sustainable, AI-first cloud infrastructure, and our Compute-focused Site Reliability Engineers are the backbone of that mission. This role is centered on supporting virtualization, hypervisor, and kernel-level performance for Crusoe’s compute infrastructure. You’ll play a vital role in deploying and optimizing bare-metal and virtualized compute platforms, ensuring performance, security, and scale for modern AI and HPC workloads.

What You'll Be Working On :

In this role, you will develop automation and observability tools to monitor Crusoe’s compute infrastructure, spanning from the kernel to orchestration layers. You will support and scale the company’s virtualization stack, including technologies such as KVM, QEMU, and other hypervisors. Collaborating with Linux kernel and hardware teams, you’ll help identify and resolve performance bottlenecks, driver issues, and optimize hardware offloads. A key focus will be on optimizing performance for AI and HPC workloads across CPU, GPU, and DPU / NIC resources. You will participate in root cause analysis for kernel crashes, hardware-software integration problems, and performance regressions, while also integrating hypervisor-level enhancements to improve guest VM reliability and workload isolation. The role involves tuning kernel subsystems such as the process scheduler, NUMA configuration, memory management, and interrupt handling. Additionally, you will work closely with platform teams to implement and validate support for emerging compute hardware, including SmartNICs, BlueField devices, and TPUs

What You’ll Bring to the Team :

8+ years of professional experience in Compute SRE, Linux system engineering, or compute infrastructure roles.

Strong proficiency in Linux kernel internals, with exposure to scheduler, memory allocation, and driver subsystems.

Experience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMware.

Familiarity with SmartNICs / DPUs (e.g., NVIDIA CX6 / 7, BlueField-3) and kernel bypass techniques.

Expert-level skills in at least one programming language : Go, C or Rust.

Experience with system-level debugging, including kdump, kexec, and kernel panic analysis.

Proficiency in Infrastructure as Code tooling and CI / CD practices for bare-metal or cloud infrastructure.

Strong understanding of compute scheduling, resource management, and high-throughput networking.

Benefits : Hybrid work schedule

Industry competitive pay

Restricted Stock Units in a fast growing, well-funded technology company

Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

Employer contributions to HSA accounts

Paid Parental Leave

Paid life insurance, short-term and long-term disability

Teladoc

401(k) with a 100% match up to 4% of salary

Generous paid time off and holiday schedule

Cell phone reimbursement

Tuition reimbursement

Subscription to the Calm app

MetLife Legal

Company paid commuter benefit; $200 per pay period

Compensation Range :

Compensation will be paid up to $250,000 per year + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex / gender, sexual preference / orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

Create a job alert for this search

Site Reliability Engineer • San Francisco, California, United States

Related jobs

Staff Site Reliability Engineer, Compute

Crusoe • San Francisco, California, United States

Full-time

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

PsiQuantum • Palo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer, Compute

Crusoe • San Francisco, California, United States

Full-time

Last updated: 30+ days ago • Promoted

Site Reliability Engineer - Inference

Lambda • San Francisco, California, United States

Full-time

In 2012, Lambda started with a crew of AI engineers publishing research at top machine-learning conferences.We began as an AI company built by AI engineers. Today, we're on a mission to be the world...Show more

Last updated: 30+ days ago • Promoted

Staff Site Reliability Engineer, Telecom & SMS

Ez Texting • San Jose, California, United States

Remote

Full-time

Who We Are EZ Texting is a recognized leader in text message marketing for small and medium-sized businesses and organizations, setting the standard for professional texting.Our messaging solutions...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Psiquantum • Palo Alto, California, United States

Full-time

Quantum computing holds the promise of humanity’s mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more

Last updated: 30+ days ago • Promoted

Software Engineer (Site Reliability Engineer)

Anyscale • San Francisco, California, United States

Full-time

Ray in their tech stacks to accelerate the progress of AI applications out into the real world.With Anyscale, we’re building the best place to run Ray, so that any developer or data scientist can s...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Xai • Palo Alto, California, United States

Full-time

AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...Show more

Last updated: 30+ days ago • Promoted

Senior Staff Site Reliability Engineer (Cortex Observability)

Palo Alto Networks • Santa Clara, California, United States

Full-time

At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and m...Show more

Last updated: 30+ days ago • Promoted

Sr. Site Reliability Engineer

Prosper • San Francisco, California, United States

Full-time

As a Senior Site Reliability Engineer (SRE) at Prosper, you will be instrumental in enhancing the reliability, scalability, and maintainability of our technology platform.This role bridges the gap ...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer - Storage

Xai • Palo Alto, California, United States

Full-time

Last updated: 30+ days ago • Promoted

Senior Staff Site Reliability Engineer

Crusoe • San Francisco, California, United States

Full-time

Last updated: 30+ days ago • Promoted

Staff Software Engineer, Site Reliability Engineer (SRE)

Harvey • San Francisco, California, United States

Full-time

Harvey is a secure AI platform for legal and professional services that augments productivity and automates complex workflows. Harvey uses algorithms with reasoning-adept LLMs that have been customi...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer (SRE)

Oppo Us Research Center • Palo Alto, California, United States

Full-time

OPPO US Research Center is seeking a skilled and proactive.Site Reliability Engineer (SRE).In this role, you will be responsible for ensuring the stability, scalability, and performance of our appl...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Replit • Foster City, California, United States

Full-time

Replit is the fastest way to turn ideas into software.With our powerful AI-powered Agent and Assistant, anyone can create and launch apps from natural language in just one click.Build and deploy fu...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer - Supercomputing

Xai • Palo Alto, California, United States

Full-time

Last updated: 30+ days ago • Promoted

Staff Site Reliability Engineer

Crusoe • San Francisco, California, United States

Full-time

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Runloop • San Francisco, California, United States

Full-time

Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show more

Last updated: 30+ days ago • Promoted