Talent.com
Site Reliability Engineer

Site Reliability Engineer

XaiMemphis, Tennessee, United States
30+ days ago
Job type
  • Full-time
Job description

About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

As a Data Center Site Reliability Engineer (SRE) at xAI, you will play a pivotal role in ensuring the reliability, scalability, and performance of our state-of-the-art data center infrastructure, including the Colossus supercluster in Memphis—the world's largest AI training cluster with over 100,000 liquid-cooled Nvidia GPUs and plans for expansion to 1 million. This infrastructure powers advanced AI workloads, massive-scale model training, and products like Grok, enabling breakthroughs in understanding the universe. You will collaborate with cross-functional teams to automate operations, enhance observability, and maintain high availability for large-scale distributed systems. This is a hands-on technical position in a dynamic environment, offering the opportunity to tackle complex challenges at the intersection of AI, data center operations, and software reliability.

Responsibilities

  • Maintain and improve the reliability and uptime of xAI’s on-premises and cloud-based data center environments, including high-density GPU clusters for AI training.
  • Design, implement, and manage monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, PagerDuty).
  • Develop and maintain infrastructure-as-code (Pulumi, Terraform) and continuous deployment pipelines (Buildkite, ArgoCD).
  • Participate in on-call rotations, respond to incidents, perform root cause analysis, and drive post-mortem processes.
  • Analyze system performance, forecast capacity needs, and optimize resource utilization for massive AI / ML workloads.
  • Collaborate with hardware, networking, and software engineering teams to design and implement resilient, scalable solutions, such as RDMA fabrics and liquid-cooling systems.
  • Create and maintain documentation and standard operating procedures.
  • Contribute to the efficiency of AI training pipelines by identifying and mitigating bottlenecks in compute, storage, and networking at unprecedented scales.

Basic Qualifications

  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent experience).
  • 5+ years in site reliability engineering, data center operations, or large-scale infrastructure management.
  • Expert-level knowledge of Kubernetes (on-prem and cloud), infrastructure-as-code tools (Pulumi, Terraform), and CI / CD systems (Buildkite, ArgoCD).
  • Proficiency in at least one systems programming language (Rust, C++, Go) and strong scripting / automation skills.
  • Deep understanding of monitoring and observability technologies.
  • Strong troubleshooting skills across hardware, networking, and distributed software systems.
  • Proven experience with incident response, including on-call rotations, rapid incident resolution, root cause analysis, and implementation of preventative measures.
  • Excellent communication and documentation skills, with the ability to share knowledge concisely and accurately.
  • Preferred Skills and Experience

  • Experience supporting AI / ML workloads or high-density compute environments, including large-scale GPU clusters and HPC systems.
  • Familiarity with data center electrical, cooling, and network systems, such as liquid-cooling and high-bandwidth interconnects.
  • Certifications in SRE, Kubernetes, or data center operations.
  • Experience with both on-premises and cloud infrastructure at scale.
  • xAI is an equal opportunity employer.

    California Consumer Privacy Act (CCPA) Notice

    Create a job alert for this search

    Site Reliability Engineer • Memphis, Tennessee, United States

    Related jobs
    • Promoted
    Private Long-Term Evolution Lead Engineer

    Private Long-Term Evolution Lead Engineer

    Memphis Light, Gas and WaterMemphis, Tennessee, United States
    Full-time
    Private Long-Term Evolution Lead Engineer - ( 250000ML ) Description MLGW is the nations largest three-service municipal utility, serving over 440,000 customers representing diverse backgrounds.Sin...Show moreLast updated: 30+ days ago
    Sr Engine Per Tech Spec

    Sr Engine Per Tech Spec

    FedExMemphis, Tennessee, US
    Full-time
    Provides daily real time monitoring and analysis of aircraft engine performance, identifying and correcting critical engine problems before they can cause a safety of flight or operational delays.P...Show moreLast updated: 30+ days ago
    Remodeling Specialist

    Remodeling Specialist

    Mr. HandymanCollierville, TN, US
    Full-time
    Join Our Team of Handyman Professionals!.Are you ready to be part of a company that takes care of its team, fills your schedule, and provides you with a company vehicle? At Mr.Handyman, we are expa...Show moreLast updated: 30+ days ago
    • Promoted
    InSite Operations Mgr

    InSite Operations Mgr

    Tennessee StaffingMemphis, TN, US
    Full-time
    HPC Industrial, powered by Clean Harbors in Memphis TN, is seeking an Operations Manager to manage overall branch operations, including ensuring safe, cost effective, and efficient day-to-day manag...Show moreLast updated: 1 day ago
    • Promoted
    • New!
    Production Systems Engineer (MEMPHIS)

    Production Systems Engineer (MEMPHIS)

    JABIL CIRCUIT, INCMEMPHIS, Tennessee, US
    Part-time
    Industrial Engineer I will support the Industrial Engineering Department in planning, designing, implementing and managing. integrated, production and service delivery systems that assure performan...Show moreLast updated: 14 hours ago
    • Promoted
    Building Engineer

    Building Engineer

    CBRE GroupMemphis, TN, US
    Full-time
    CBRE Global Workplace Solutions (GWS) works with clients to make real estate a significant contributor to organizational productivity and performance. Our account management model is at the heart of...Show moreLast updated: 30+ days ago
    • Promoted
    InSite Operations Mgr

    InSite Operations Mgr

    Memphis StaffingMemphis, TN, US
    Full-time
    HPC Industrial, powered by Clean Harbors in Memphis TN, is seeking an Operations Manager to manage overall branch operations, including ensuring safe, cost effective, and efficient day-to-day manag...Show moreLast updated: 3 days ago
    • Promoted
    Continuous Improvement Specialist

    Continuous Improvement Specialist

    Rite-Hite CompanyHorn Lake, MS, United States
    Full-time
    Our innovative products and world class sales organization ensure solid, consistent growth, both for our company and our staff. We are always looking ahead to develop innovative new products and ser...Show moreLast updated: 30+ days ago
    • New!
    Entry-Level Energy Engineer - Full-Time

    Entry-Level Energy Engineer - Full-Time

    GpacMemphis, Tennessee, United States
    Full-time
    Quick Apply
    Entry-Level Energy Engineer – Full-Time.Energy / Manufacturing / Utilities / Sustainability.This role provides hands-on experience in energy analysis, system optimization, and project impleme...Show moreLast updated: 3 hours ago
    • Promoted
    Utility Locator

    Utility Locator

    USICMarion, AR, US
    Full-time
    Text JOBS to 811344 to connect with our hiring team today!.Are you an outdoor enthusiast who enjoys independent field work and is looking to jump start your career? If you are a quality-conscious, ...Show moreLast updated: 11 days ago
    • Promoted
    Mechanical Area Reliability Leader

    Mechanical Area Reliability Leader

    Georgia-Pacific LLCSouthaven, MS, US
    Full-time
    Operations & Manufacturing.The Area Reliability Leader will play a pivotal role in ensuring the optimal performance and reliability of equipment and processes within either the Dryer Department...Show moreLast updated: 4 days ago
    • Promoted
    Production Team Leader

    Production Team Leader

    Rite-Hite CompanyHorn Lake, MS, United States
    Full-time
    Our innovative products and world class sales organization ensure solid, consistent growth, both for our company and our staff. We are always looking ahead to develop innovative new products and ser...Show moreLast updated: 20 days ago
    • Promoted
    Software Engineer II, Craft Education

    Software Engineer II, Craft Education

    Western Governors UniversityMemphis, TN, United States
    Full-time +1
    If you're passionate about building a better future for individuals, communities, and our country-and you're committed to working hard to play your part in building that future-consider Craft Educa...Show moreLast updated: 3 days ago
    • Promoted
    Onsite Supervisor

    Onsite Supervisor

    Staffmark GroupMemphis, TN, United States
    Full-time
    We are currently hiring an Onsite Supervisor.Our ideal candidate will possess a passion for impacting lives and our community. The Onsite Supervisor plays a critical and strategic role in delivering...Show moreLast updated: 4 days ago
    • Promoted
    Operations Engineer

    Operations Engineer

    TradeJobsWorkForce38002 Lakeland, TN, US
    Full-time
    Operations Engineer Job Duties : Identifies operational problems by observing and studying system func...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Industrial Engineer - Optimize production efficiency and cost-effectiveness. (MEMPHIS)

    Industrial Engineer - Optimize production efficiency and cost-effectiveness. (MEMPHIS)

    JABIL CIRCUIT, INCMEMPHIS, Tennessee, US
    Part-time
    Industrial Engineer I will support the Industrial Engineering Department in planning, designing, implementing and managing. integrated, production and service delivery systems that assure performan...Show moreLast updated: 14 hours ago
    • Promoted
    • New!
    Test Engineer - Autonomous Innovation Environment (MEMPHIS)

    Test Engineer - Autonomous Innovation Environment (MEMPHIS)

    JABIL CIRCUIT, INCMEMPHIS, Tennessee, US
    Part-time
    Under limited supervision designs, develops and maintains test procedures, tester hardware and software for electronic circuit board production. ESSENTIAL DUTIES AND RESPONSIBILITIES include the fol...Show moreLast updated: 14 hours ago
    • Promoted
    Building Engineer

    Building Engineer

    Memphis StaffingMemphis, TN, US
    Full-time
    As a CBRE Building Engineer, you will be responsible for monitoring, maintaining, and repairing building system operations and the performance of various aspects including plumbing, electrical, pai...Show moreLast updated: 18 days ago