Site Reliability Engineer

xAIPalo Alto, CA, United States

5 hours ago

Job type

Full-time

Job description

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers and researchers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

As a Data Center Site Reliability Engineer (SRE) at xAI, you will play a pivotal role in ensuring the reliability, scalability, and performance of our state-of-the-art data center infrastructure, including the Colossus supercluster in Memphis—the world's largest AI training cluster with over 100,000 liquid-cooled Nvidia GPUs and plans for expansion to 1 million. This infrastructure powers advanced AI workloads, massive-scale model training, and products like Grok, enabling breakthroughs in understanding the universe. You will collaborate with cross-functional teams to automate operations, enhance observability, and maintain high availability for large-scale distributed systems. This is a hands-on technical position in a dynamic environment, offering the opportunity to tackle complex challenges at the intersection of AI, data center operations, and software reliability.

Key Responsibilities

Maintain and improve the reliability and uptime of xAI’s on-premises and cloud-based data center environments, including high-density GPU clusters for AI training.
Design, implement, and manage monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, PagerDuty).
Develop and maintain infrastructure-as-code (Pulumi, Terraform) and continuous deployment pipelines (Buildkite, ArgoCD).
Participate in on-call rotations, respond to incidents, perform root cause analysis, and drive post-mortem processes.
Analyze system performance, forecast capacity needs, and optimize resource utilization for massive AI / ML workloads.
Collaborate with hardware, networking, and software engineering teams to design and implement resilient, scalable solutions, such as RDMA fabrics and liquid-cooling systems.
Create and maintain documentation and standard operating procedures.
Contribute to the efficiency of AI training pipelines by identifying and mitigating bottlenecks in compute, storage, and networking at unprecedented scales.

Required Qualifications

Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent experience).

5+ years in site reliability engineering, data center operations, or large-scale infrastructure management.

Expert-level knowledge of Kubernetes (on-prem and cloud), infrastructure-as-code tools (Pulumi, Terraform), and CI / CD systems (Buildkite, ArgoCD).

Proficiency in at least one systems programming language (Rust, C++, Go) and strong scripting / automation skills.

Deep understanding of monitoring and observability technologies.

Strong troubleshooting skills across hardware, networking, and distributed software systems.

Proven experience with incident response, including on-call rotations, rapid incident resolution, root cause analysis, and implementation of preventative measures.

Excellent communication and documentation skills, with the ability to share knowledge concisely and accurately.

Preferred Qualifications

Experience supporting AI / ML workloads or high-density compute environments, including large-scale GPU clusters and HPC systems.

Familiarity with data center electrical, cooling, and network systems, such as liquid-cooling and high-bandwidth interconnects.

Certifications in SRE, Kubernetes, or data center operations.

Experience with both on-premises and cloud infrastructure at scale.

Accepted file types : pdf, doc, docx, txt, rtf

Enter manually

Accepted file types : pdf, doc, docx, txt, rtf

Website

LinkedIn Profile

What makes you the ideal candidate for this position?

What exceptional work have you done?

In 100 words or less, tell us about a piece of work you are most proud of.

Will you now, or in the future, require sponsorship for employment visa status (e.g., H-1B visa) to legally work for X.AI LLC in the U.S.?

Select...

#J-18808-Ljbffr

Create a job alert for this search

Site Reliability Engineer • Palo Alto, CA, United States

Related jobs

Promoted

Customer Reliability Engineer

VirtualVocationsHayward, California, United States

Full-time

A company is looking for a Customer Reliability Engineer III.Key Responsibilities Manage and resolve customer technical issues via support tickets and real-time interactions Act as a liaison bet...Show moreLast updated: 30+ days ago

Promoted
New!

Site Reliability Engineer

Redwood Materials, Inc.San Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show moreLast updated: 5 hours ago

Promoted

Site Reliability Engineer

VirtualVocationsSan Francisco, California, United States

Full-time

A company is looking for a Mid-Sr.Site Reliability Engineer with a focus on on-prem Kubernetes / K8s.Key Responsibilities Manage and maintain on-premise containerized environments Deploy resources...Show moreLast updated: 30+ days ago

Promoted
New!

Site Reliability Engineer

criteoPalo Alto, CA, United States

Full-time

At Criteo we face challenging problems in the IT industry at scale.Our data is large and our systems require speed and complexity handling. We have about 40 petabytes in Hadoop storage and respond t...Show moreLast updated: 5 hours ago

Promoted
New!

Senior Site Reliability Engineer

LiveRampSan Francisco, CA, United States

Full-time

Join to apply for the Senior Site Reliability Engineer role at LiveRamp.LiveRamp is the data collaboration platform of choice for the world’s most innovative companies. A groundbreaking leader in co...Show moreLast updated: 5 hours ago

Promoted

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago

Promoted
New!

Site Reliability Engineer

Together AISan Francisco, CA, United States

Full-time

As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show moreLast updated: 5 hours ago

Promoted
New!

Principal Site Reliability Engineer

JPMorganChasePalo Alto, CA, United States

Full-time

Join a globally recognized financial organization and advance your profession to new heights by contributing to revolutionary projects. You've discovered the perfect environment to have a major impa...Show moreLast updated: 5 hours ago

Promoted

Site Reliability Engineer

PerplexitySan Francisco, CA, United States

Full-time

Perplexity is an AI-powered answer engine founded in December 2022 and growing rapidly as one of the world’s leading AI platforms. Perplexity has raised over $1B in venture investment from some of t...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

AlchemySan Francisco, CA, United States

Full-time

Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show moreLast updated: 30+ days ago

Promoted
New!

Site Reliability Engineer

Redwood MaterialsSan Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling — keeping critical minerals in circulation and driving the energy transition.Founded in...Show moreLast updated: 5 hours ago

Promoted

Site Reliability Engineer

PrimerSan Francisco, CA, United States

Full-time

Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show moreLast updated: 30+ days ago

Promoted
New!

Site Reliability Engineer

Jobs via DiceRedwood City, CA, United States

Full-time

Dice is the leading career destination for tech experts at every stage of their careers.Our client, Kforce Technology Staffing, is seeking a Reliability Engineer in Redwood City, CA.Deliver high-le...Show moreLast updated: 5 hours ago

Promoted

Senior Site Reliability Engineer

HiveSan Francisco, CA, United States

Full-time

Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer II

Hinge HealthSan Francisco, CA, United States

Full-time

From scaling Kubernetes clusters to improving observability with Datadog, we build the tooling and automation that empower product teams to ship with confidence. Collaborate with engineering teams t...Show moreLast updated: 30+ days ago

Promoted
New!

Site Reliability Engineer

ZapierSan Francisco, CA, United States

Full-time

We're humans who simply think computers should do more work.At Zapier, we’re not just making software—we’re building a platform to help millions of businesses globally scale with automation and AI....Show moreLast updated: 5 hours ago

Promoted
New!

Site Reliability Engineer

Bits to AtomsSan Francisco, CA, United States

Full-time

Site Reliability Engineer (SRE).You’ll work at the intersection of infrastructure, AI / ML systems, and mission-critical physical operations. You’ll collaborate directly with engineering, AI, and oper...Show moreLast updated: 5 hours ago

Promoted

Senior Site Reliability Engineer

VirtualVocationsOakland, California, United States

Full-time

A company is looking for a Senior Site Reliability Engineer.Key Responsibilities Design and implement infrastructure and automation scripts for AWS deployment and management Optimize and monitor...Show moreLast updated: 30+ days ago