Talent.com
Staff Site Reliability Engineer - Managed AI

Staff Site Reliability Engineer - Managed AI

CrusoeSan Francisco, CA, United States
28 days ago
Job type
  • Full-time
Job description

About the Role

At Crusoe, our Site Reliability Engineering team ensures the reliability and scalability of Crusoe’s AI-optimized cloud platform. We’re looking for an SRE with a strong background in distributed systems and hands-on experience with large language models to help us build and operate managed AI services at scale. This role is central to delivering highly available, performant, and cost-efficient AI infrastructure that powers compute-intensive, latency-sensitive workloads for our customers.

What You’ll Work On

  • Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
  • Build automation and reliability tooling to support distributed AI pipelines and inference services
  • Define, measure, and improve SLIs / SLOs across AI workloads to ensure performance and reliability targets are met
  • Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
  • Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
  • Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
  • Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments

What You’ll Bring

  • Strong software engineering background — experience building production-grade systems beyond scripting or Bash
  • Demonstrated experience in distributed systems design and implementation
  • Hands-on work with large language models (LLMs) or AI / ML infrastructure
  • SRE mindset and experience (whether or not under the SRE title) including :
  • Defining and measuring SLIs / SLOs

  • Building monitoring and observability systems
  • Driving performance and reliability improvements
  • Designing fault-tolerant systems and automated testing strategies
  • Proficiency in at least one modern programming language (Python, Go, Java, C++)
  • Familiarity with Kubernetes or container orchestration platforms
  • Strong collaboration and communication skills
  • Ability to thrive in a fast-paced, mission-driven environment
  • Bonus Points

  • Experience scaling inference or training workloads for LLMs
  • Benefits

  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit; $300 per month
  • Compensation

    Compensation will be paid in the range of $204,000 - $247,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

    Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex / gender, sexual preference / orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

    #J-18808-Ljbffr

    Create a job alert for this search

    Site Reliability Engineer • San Francisco, CA, United States

    Related jobs
    • Promoted
    Site Reliability Engineer - SRE at Descope Los Altos, CA

    Site Reliability Engineer - SRE at Descope Los Altos, CA

    Itlearn360Los Altos, CA, United States
    Full-time
    Site Reliability Engineer - SRE job at Descope.Descope R&D group is a skilled team of developers with a unique DNA of creativity,flexibility,anopen mindset. We are looking for a passionate SRE to jo...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer (SRE)

    Site Reliability Engineer (SRE)

    AI FundSan Francisco, CA, United States
    Full-time
    Baseten powers inference for the world's most dynamic AI companies, like.As a Site Reliability Engineer, you'll envision and build robust systems and processes that ensure our infrastructure is sca...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Insight GlobalSanta Clara, CA, US
    Full-time
    Insight Global is looking for a seasoned SRE to join one of our largest technology clients' multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working...Show moreLast updated: 14 days ago
    • Promoted
    Staff Site Reliability Engineer

    Staff Site Reliability Engineer

    Altana AISan Francisco, CA, United States
    Full-time
    AI can be a powerful tool for good in the world – at Altana we apply AI to the world’s largest organized body of supply chain data to power a more resilient, more secure, and more sustainable model...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Bits to AtomsSan Francisco, CA, United States
    Full-time
    Site Reliability Engineer (SRE).You’ll work at the intersection of infrastructure, AI / ML systems, and mission-critical physical operations. You’ll collaborate directly with engineering, AI, and oper...Show moreLast updated: 30+ days ago
    • Promoted
    Senior / Staff Site Reliability Engineer, Compute

    Senior / Staff Site Reliability Engineer, Compute

    FluidstackSan Francisco, CA, United States
    Full-time
    Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises.Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more. Our team is small, highly motivate...Show moreLast updated: 30+ days ago
    • Promoted
    Staff Engineer, Site Reliability

    Staff Engineer, Site Reliability

    ZapierSan Francisco, CA, United States
    Full-time
    Zapier is building a platform to help millions of businesses globally scale with automation and AI.Our mission is to make automation work for everyone by delivering products that delight our custom...Show moreLast updated: 30+ days ago
    • Promoted
    Founding Site Reliability Engineer

    Founding Site Reliability Engineer

    ReductoSan Francisco, CA, United States
    Full-time
    Reducto helps AI teams ingest real world enterprise data with state of the art accuracy.The vast majority of enterprise data — from financial statements to health records — is locked in unstructure...Show moreLast updated: 4 days ago
    • Promoted
    Reliability Engineer

    Reliability Engineer

    PeriodicMenlo Park, CA, United States
    Full-time
    We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries.We are well funded and growing rapidly. Team members are owners who identify and solve prob...Show moreLast updated: 16 days ago
    • Promoted
    Staff Site Reliability Engineer

    Staff Site Reliability Engineer

    CrusoeSan Francisco, CA, United States
    Full-time
    Crusoe is building the World’s Favorite AI-first Cloud infrastructure company.We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to p...Show moreLast updated: 30+ days ago
    • Promoted
    Staff Site Reliability Engineer, Fabric

    Staff Site Reliability Engineer, Fabric

    MongoDBSan Francisco, CA, United States
    Full-time
    Staff Site Reliability Engineer, Fabric.MongoDB’s mission is to empower innovators to create, transform, and disrupt industries by unleashing the power of software and data.We enable organizations ...Show moreLast updated: 30+ days ago
    • Promoted
    Staff ML Engineer

    Staff ML Engineer

    SynergisFremont, CA, US
    Permanent
    Detroit, MI or San Francisco, CA.The ML Inference Platform is part of the AI Compute Platforms organization within Infrastructure Platforms. Our team owns the cloud-agnostic, reliable, and cost-effi...Show moreLast updated: 6 days ago
    • Promoted
    Staff Machine Learning Engineer, AI Platform

    Staff Machine Learning Engineer, AI Platform

    General MotorsSunnyvale, CA, United States
    Full-time
    Remote : This role is based remotely but if you live within a 50-mile radius of Mountain View, you are expected to report to that location three times a week, at minimum. We are seeking an experience...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Runloop AISan Francisco, CA, United States
    Full-time
    Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show moreLast updated: 3 days ago
    • Promoted
    Site Reliability Engineer (SRE) - grok.com & API

    Site Reliability Engineer (SRE) - grok.com & API

    Pantera CapitalPalo Alto, CA, United States
    Full-time
    AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...Show moreLast updated: 26 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Signify TechnologyPalo Alto, CA, United States
    Full-time
    Competitive, based on experience.We are a technology startup advancing healthcare with a safety-focused AI platform that assists medical professionals by managing patient communications, including ...Show moreLast updated: 13 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Rockwoods IncPleasanton, CA, United States
    Full-time
    Note : Candidates must have relevant experience in Medical / Healthcare domains, this is mandatory.Senior SRE Engineer - Pleasanton, 5 days office. Primary work : 24x7 On-call support and setting up mo...Show moreLast updated: 13 days ago
    • Promoted
    Site Reliability Engineer - Cybersecurity

    Site Reliability Engineer - Cybersecurity

    Pantera CapitalPalo Alto, CA, United States
    Full-time
    AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...Show moreLast updated: 15 days ago