Talent.com
Site Reliability Engineer - Storage

Site Reliability Engineer - Storage

xAIMemphis, TN, US
2 days ago
Job type
  • Full-time
Job description

Job Description

Job Description

About xAI

xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company's mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

As a Site Reliability Engineer (SRE) : Storage at xAI, you will play a pivotal role in ensuring the reliability, scalability, and performance of our petabyte-to-exabyte scale storage infrastructure, including filesystems and our internal storage product supporting the Colossus superclusters in Memphis — the world's largest AI training clusters with hundreds of thousands of liquid-cooled GPUs. We're deploying multiple exabytes of storage this year across several sites to fuel Grok's training and advanced AI workloads. You will collaborate with storage engineers, software engineers and hardware storage teams to deploy, troubleshoot, and optimize storage for 24 / 7 AI I / O demands like checkpointing and dataset streaming, long term archival storage, and ensure maximum uptime. This is a hands-on technical position in a dynamic environment, offering the opportunity to tackle complex challenges at the intersection of storage systems, hardware integration, and reliability engineering.

Responsibilities

  • Deploy, maintain, and scale exabyte-scale storage clusters with a focus on observability, zero-downtime upgrades, and integration with high-density GPU environments.
  • Troubleshoot production storage issues across hardware-software stacks : NVMe / PCIe / RDMA paths, firmware bugs, BMC logs, disk failures—performing root cause analysis and automating preventions.
  • Collaborate with storage teams to validate server specs, debug field problems and influence custom designs with vendors for cutting-edge AI storage.
  • Evaluate and onboard new storage vendors and technologies; benchmark for cost, density and GPU-direct performance against AI training I / O patterns.
  • Support storage SDEs by translating engineering requirements into reliable, observable systems; develop scripting and playbooks to reduce toil and enable self-service.
  • Lead hardware refreshes for legacy X storage fleets, including migration, decommissioning, and designing repeatable processes for customized solutions.
  • Participate in on-call rotations (follow-the-sun, generous stipend) for storage domains; respond to incidents, drive post-mortems, and forecast capacity for EiB+ growth.
  • Create and maintain documentation, standard operating procedures, and monitoring for storage health in massive-scale AI pipelines.

Required Qualifications

  • Bachelor's degree in Computer Science, Engineering, or a related technical field (or equivalent experience).
  • 3+ years in site reliability engineering, systems engineering, or storage operations at multi-PB+ scale.
  • Hands-on experience with storage systems from various vendors like VAST, DDN, Dell, and parallel filesystems (such as Lustre, GPFS, Weka) and Linux storage stacks (kernel tuning, eBPF, blktrace, NVMe / RDMA / RoCE).
  • Proficiency in scripting for automation (Python / Bash); light programming experience (Go nice-to-have) but emphasis on operational clarity over heavy coding.
  • Strong troubleshooting skills across storage hardware (e.g., harddrives, SSDs, NVME drives, drive enclosures, and software + firmware) and vendor qualification / refresh cycles.
  • Experience with incident response, including on-call rotations, rapid resolution, root cause analysis, and implementation of preventative measures.
  • Basic hardware knowledge for storage bring-up and debugging in data center environments.
  • Excellent communication and documentation skills, with the ability to share knowledge concisely and accurately.
  • xAI is an equal opportunity employer.

    California Consumer Privacy Act (CCPA) Notice

    Create a job alert for this search

    Site Reliability Engineer • Memphis, TN, US

    Related jobs
    • Promoted
    Site Reliability Engineer - Monitoring Specialist

    Site Reliability Engineer - Monitoring Specialist

    xAIMemphis, TN, US
    Full-time
    AI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering exc...Show moreLast updated: 2 days ago
    • Promoted
    Travel Cath Lab Tech - $2,371 to $2,571 per week in Bartlett, TN

    Travel Cath Lab Tech - $2,371 to $2,571 per week in Bartlett, TN

    AlliedTravelCareersBartlett, TN, US
    Full-time
    AlliedTravelCareers is working with Titan Medical Group to find a qualified Cath Lab Tech in Bartlett, Tennessee, 38133!. RCIS / ARRT(R) / BCLS / BLS - American Heart Association.Titan Medical is looking ...Show moreLast updated: 7 days ago
    • Promoted
    Automotive Detailer / Lot Attendant - Entry Level

    Automotive Detailer / Lot Attendant - Entry Level

    CarvanaBartlett, TN, US
    Full-time
    Carvana - the fastest-growing used automotive retailer in U.In these entry-level roles, you'll have a number of positions to choose from : . At Carvana, you'll receive a.Carvana match and even...Show moreLast updated: 30+ days ago
    • Promoted
    Travel Cath Lab Tech - $2,568 per week in Bartlett, TN

    Travel Cath Lab Tech - $2,568 per week in Bartlett, TN

    AlliedTravelCareersBartlett, TN, US
    Full-time
    AlliedTravelCareers is working with FlexCare to find a qualified Cath Lab Tech in Bartlett, Tennessee, 38133!.FlexCare is a nationwide leader in the staffing of travel nurses and clinicians.With ac...Show moreLast updated: 24 days ago
    • Promoted
    Travel Cath Lab Tech - $2446.4 / Week

    Travel Cath Lab Tech - $2446.4 / Week

    CrossMed HealthcareBartlett, TN, US
    Full-time
    CrossMed Healthcare is seeking an experienced Cath Lab Tech for an exciting Travel Allied job in Bartlett, TN.Shift : 8 hr days Start Date : ASAP Duration : 13 weeks Pay : $2446.At CrossMed Healthcare ...Show moreLast updated: 30+ days ago
    • Promoted
    Fulfillment Center Associate

    Fulfillment Center Associate

    FedExMillington, CT, US
    Full-time +1
    Come for a job and stay for a career! Federal Express Corporation (FEC) is part of the rapidly growing warehouse and transportation sector that helps keep America, and our economy, moving.Be part o...Show moreLast updated: 5 days ago
    • Promoted
    Site Reliability Engineer - Hardware Specialist

    Site Reliability Engineer - Hardware Specialist

    xAIMemphis, TN, US
    Full-time
    AI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering exc...Show moreLast updated: 1 day ago
    • Promoted
    Travel Cath Lab Tech - $1,984 to $2,137 per week in Bartlett, TN

    Travel Cath Lab Tech - $1,984 to $2,137 per week in Bartlett, TN

    AlliedTravelCareersBartlett, TN, US
    Full-time
    AlliedTravelCareers is working with Host Healthcare to find a qualified Cath Lab Tech in Bartlett, Tennessee, 38133!.Host Healthcare is an award-winning travel healthcare company with an immediate ...Show moreLast updated: 30+ days ago
    • Promoted
    Inventory Lot Attendant - Post Production

    Inventory Lot Attendant - Post Production

    CarvanaBartlett, TN, US
    Full-time
    Carvana - the fastest-growing used automotive retailer in U.In these entry-level roles, you'll have several positions to choose from : . General qualifications and requirements.Ability to physical...Show moreLast updated: 30+ days ago
    • Promoted
    Travel Cath Lab Tech - $2136.6 / Week

    Travel Cath Lab Tech - $2136.6 / Week

    Host HealthcareBartlett, TN, US
    Full-time
    Host Healthcare is seeking an experienced Cath Lab Tech for an exciting Travel Allied job in Bartlett, TN.Shift : Inquire Start Date : 11 / 07 / 2025 Duration : 13 weeks Pay : $2136.At Host Healthcare, we ...Show moreLast updated: 30+ days ago
    • Promoted
    Travel Cath Lab Tech - $2661 / Week

    Travel Cath Lab Tech - $2661 / Week

    Cynet HealthBartlett, TN, US
    Full-time
    Cynet Health is seeking an experienced Cath Lab Tech for an exciting Travel Allied job in Bartlett, TN.Shift : 5x8 hr days Start Date : ASAP Duration : 13 weeks Pay : $2661 / Week.Ranked #5 Best Travel...Show moreLast updated: 30+ days ago
    • Promoted
    Travel Operating Room RN in Memphis, Tennessee

    Travel Operating Room RN in Memphis, Tennessee

    Gifted HealthcareMarion, AR, US
    Full-time
    Operating Room Nurses (OR RNs) care for patients before, during, and after surgery and assist the surgeon in procedures.The OR RN provides direct and individualized care to patients.OR RN job respo...Show moreLast updated: 2 days ago
    • Promoted
    Logistics Support Associate

    Logistics Support Associate

    FedExMillington, CT, US
    Full-time +1
    Come for a job and stay for a career! Federal Express Corporation (FEC) is part of the rapidly growing warehouse and transportation sector that helps keep America, and our economy, moving.Be part o...Show moreLast updated: 4 days ago
    • Promoted
    Travel Cath Lab Tech - $2,202 to $2,442 per week in Bartlett, TN

    Travel Cath Lab Tech - $2,202 to $2,442 per week in Bartlett, TN

    AlliedTravelCareersBartlett, TN, US
    Full-time
    AlliedTravelCareers is working with LRS Healthcare to find a qualified Cath Lab Tech in Bartlett, Tennessee, 38133!.Ready to start your next travel adventure? LRS Healthcare offers a full benefits ...Show moreLast updated: 30+ days ago
    • Promoted
    Entry-Level Automotive Detailer / Lot Attendant

    Entry-Level Automotive Detailer / Lot Attendant

    CarvanaWest Memphis, AR, US
    Full-time
    Carvana - the fastest-growing used automotive retailer in U.In these entry-level roles, you'll have a number of positions to choose from : . At Carvana, you'll receive a.Carvana match and even...Show moreLast updated: 30+ days ago
    • Promoted
    Travel Cath Lab Tech - $2564.9 / Week

    Travel Cath Lab Tech - $2564.9 / Week

    Uniti MedBartlett, TN, US
    Full-time
    Uniti Med is seeking an experienced Cath Lab Tech for an exciting Travel Allied job in Bartlett, TN.Shift : Inquire Start Date : ASAP Duration : 13 weeks Pay : $2564. Uniti Med provides career opportuni...Show moreLast updated: 30+ days ago
    • Promoted
    Maintenance Specialist

    Maintenance Specialist

    Advanced Technology ServicesMarion, AR, US
    Full-time
    Founded in 1985, ATS is a company with a presence in the United States, Mexico and the United Kingdom.We are professionals in Industrial Maintenance and we make factories run better.Fundada en 1985...Show moreLast updated: 5 days ago
    • Promoted
    Travel Cath Lab Tech - $2,446 per week in Bartlett, TN

    Travel Cath Lab Tech - $2,446 per week in Bartlett, TN

    AlliedTravelCareersBartlett, TN, US
    Full-time
    AlliedTravelCareers is working with CrossMed Healthcare Staffing to find a qualified Cath Lab Tech in Bartlett, Tennessee, 38133!. At CrossMed Healthcare Staffing, we aim to create lasting impressio...Show moreLast updated: 29 days ago