Talent.com
Senior Site Reliability Engineer, Compute

Senior Site Reliability Engineer, Compute

CrusoeSan Francisco, CA, United States
2 days ago
Job type
  • Full-time
Job description

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role :

At Crusoe, we are building the most sustainable, AI-first cloud infrastructure, and our Compute-focused Site Reliability Engineers are the backbone of that mission. This role is centered on supporting virtualization, hypervisor, and kernel-level performance for Crusoe’s compute infrastructure. You’ll play a vital role in deploying and optimizing bare-metal and virtualized compute platforms, ensuring performance, security, and scale for modern AI and HPC workloads.

What You’ll Be Working On :

In this role, you will develop automation and observability tools to monitor Crusoe’s compute infrastructure, spanning from the kernel to orchestration layers. You will support and scale the company’s virtualization stack, including technologies such as KVM, QEMU, and other hypervisors. Collaborating with Linux kernel and hardware teams, you’ll help identify and resolve performance bottlenecks, driver issues, and optimize hardware offloads. A key focus will be on optimizing performance for AI and HPC workloads across CPU, GPU, and DPU / NIC resources. You will participate in root cause analysis for kernel crashes, hardware‑software integration problems, and performance regressions, while also integrating hypervisor‑level enhancements to improve guest VM reliability and workload isolation. The role involves tuning kernel subsystems such as the process scheduler, NUMA configuration, memory management, and interrupt handling. Additionally, you will work closely with platform teams to implement and validate support for emerging compute hardware, including SmartNICs, BlueField devices, and TPUs.

What You’ll Bring to the Team :

  • 8+ years of professional experience in Compute SRE, Linux system engineering, or compute infrastructure roles.
  • Strong proficiency in Linux kernel internals, with exposure to scheduler, memory allocation, and driver subsystems.
  • Experience with virtualization architectures and technologies such as KVM, Xen, QEMU, or VMware.
  • Familiarity with SmartNICs / DPUs (e.g., NVIDIA CX6 / 7, BlueField-3) and kernel bypass techniques.
  • Expert‑level skills in at least one programming language : Go, C or Rust.
  • Experience with system‑level debugging, including kdump, kexec, and kernel panic analysis.
  • Proficiency in Infrastructure as Code tooling and CI / CD practices for bare‑metal or cloud infrastructure.
  • Strong understanding of compute scheduling, resource management, and high‑throughput networking.

Benefits :

  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well‑funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short‑term and long‑term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit; $300 / month
  • Compensation Range :

    Compensation will be paid in the range of $172,000 - $209,000 a year + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

    Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex / gender, sexual preference / orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

    #J-18808-Ljbffr

    Create a job alert for this search

    Senior Site Reliability Engineer • San Francisco, CA, United States

    Related jobs
    • Promoted
    Senior Site Reliability Engineer - Managed Kubernetes

    Senior Site Reliability Engineer - Managed Kubernetes

    LambdaSan Francisco, CA, United States
    Full-time
    Senior Site Reliability Engineer - Managed Kubernetes.Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambdas mission is to make compute as ubiqui...Show moreLast updated: 7 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    NVIDIASanta Clara, CA, United States
    Full-time
    NVIDIA is looking for a Senior Site Reliability Engineer to work in IPP (Infrastructure, Planning and Process).IPP is a global organization within NVIDIA. This group works with various other groups ...Show moreLast updated: 7 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Syntricate TechnologiesSan Jose, CA, United States
    Full-time
    Extensive experience working with linux flavors like rhel / centos os, shells, filesystems and utilities.Knowledge of distributed computing and experience working with container orchestration framewo...Show moreLast updated: 7 days ago
    • Promoted
    Senior Site Reliability Engineer, BCM - DGX Cloud

    Senior Site Reliability Engineer, BCM - DGX Cloud

    NVIDIASanta Clara, CA, United States
    Full-time
    Senior Site Reliability Engineer, BCM - DGX Cloud page is loaded## Senior Site Reliability Engineer, BCM - DGX Cloudlocations : US, CA, Santa Clara : US, Remotetime type : Full timeposted on : Posted Y...Show moreLast updated: 7 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Sustainable TalentSanta Clara, CA, United States
    Full-time
    Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based on...Show moreLast updated: 7 days ago
    • Promoted
    Senior Site Reliability Engineer - Observability and Telemetry Platform

    Senior Site Reliability Engineer - Observability and Telemetry Platform

    NVIDIASanta Clara, CA, United States
    Full-time
    Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination of...Show moreLast updated: 7 days ago
    • Promoted
    Senior Site Reliability Engineer Cloud Platform

    Senior Site Reliability Engineer Cloud Platform

    ZillizRedwood City, CA, United States
    Full-time
    Zilliz is a fast-growing startup developing the industry's leading vector database company for enterprise-grade AI.Founded by the engineers behind Milvus, the world's most popular open-source vecto...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantumPalo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
    • Promoted
    Senior / Staff Site Reliability Engineer, Compute

    Senior / Staff Site Reliability Engineer, Compute

    FluidstackSan Francisco, CA, United States
    Full-time
    Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises.Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more. Our team is small, highly motivate...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability EngineerNovato, California, United States

    Senior Site Reliability EngineerNovato, California, United States

    2KNovato, CA, United States
    Full-time
    Senior Site Reliability Engineer.Novato, California, United States.On-Call Requirement : Yes (Periodic Rotation).K is headquartered in Novato, California and is a wholly owned label of Take-Two Inte...Show moreLast updated: 6 days ago
    • Promoted
    Senior Site Reliability Engineer - Storage

    Senior Site Reliability Engineer - Storage

    NVIDIASanta Clara, CA, United States
    Full-time
    Senior Site Reliability Engineer - Storage page is loaded.Senior Site Reliability Engineer - Storage.Apply locations US, CA, Santa Clara time type Full time posted on Posted 3 Days Ago job requisit...Show moreLast updated: 7 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    ZooxSan Mateo, CA, United States
    Full-time
    Zoox is looking for a platform / site reliability engineer who will be responsible for measuring and maintaining the uptime of the many services critical to the development process for autonomous veh...Show moreLast updated: 5 days ago
    • Promoted
    Site Reliability Engineer - Supercomputing

    Site Reliability Engineer - Supercomputing

    XaiSan Francisco, CA, United States
    Full-time
    Site Reliability Engineer - Supercomputing.We are seeking a talented Site Reliability Engineer (SRE) to join our SuperComputing team. In this role, you'll ensure the reliability, scalability, and pe...Show moreLast updated: 1 day ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    LanceDBSan Francisco, CA, United States
    Full-time
    LanceDB is a developer-friendly, open-source data lake for multimodal AI.From hyper-scalable vector search to advanced retrieval for RAG, from streaming training data to interactive exploration of ...Show moreLast updated: 8 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Citizen HealthSan Francisco, CA, United States
    Full-time
    Senior Site Reliability Engineer.Citizen Health was founded on the belief that having the right advocate is the single most important factor in achieving better care and outcomes.By uniquely combin...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer - Managed Kubernetes (Senior)

    Site Reliability Engineer - Managed Kubernetes (Senior)

    LambdaSan Francisco, CA, United States
    Full-time
    We're here to help the smartest minds on the planet build Superintelligence.The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with the...Show moreLast updated: 1 day ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    AppOmniSan Francisco, CA, United States
    Full-time
    AppOmni, a leader in SaaS Security, helps customers achieve secure productivity with their applications.Security teams and owners can quickly detect and mitigate threats using unmatched depth of pr...Show moreLast updated: 2 days ago
    • Promoted
    Senior Site Reliability Engineer - DGX Cloud

    Senior Site Reliability Engineer - DGX Cloud

    NVIDIASanta Clara, CA, United States
    Full-time
    Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination of...Show moreLast updated: 7 days ago