Talent.com
Site Reliability Engineer - Storage
Site Reliability Engineer - StorageXai • Palo Alto, California, United States
Site Reliability Engineer - Storage

Site Reliability Engineer - Storage

Xai • Palo Alto, California, United States
30+ days ago
Job type
  • Full-time
Job description

About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers and researchers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the role

As a Site Reliability Storage Engineer, you will play a pivotal role in designing, building, and operating exascale storage systems to manage our cutting-edge AI research data with unparalleled scalability and reliability across multiple regions. This role's core responsibility is to make sure our heterogenous storage systems in on-prem + cloud are reliable and performant.

We’re seeking engineers with expertise in exascale data management systems or distributed filesystems to join our mission-driven team.

What you’ll do

  • Develop and optimize software to manage exascale data, enabling efficient and reliable access for xAI researchers working on advanced AI models.
  • Enhance the reliability, performance, and cost-effectiveness of xAI’s storage infrastructure to support large-scale AI research workloads.
  • Collaborate closely with researchers to understand their data use cases and tailor storage solutions to meet their needs.
  • Implement robust security measures to safeguard critical datasets, ensuring data integrity and confidentiality.

Ideal Experience

You’d be an exceptional candidate if you possess some (or all) of the following :

  • Writing scalable, high-performance code in Rust or Go for storage-related applications or tooling.
  • Managing storage infrastructure with IaC tools like Pulumi, Terraform, or Ansible.
  • Past experience working with storage vendors facilitating partnership alignment, and integrating their tooling within xAI’s Infrastructure.
  • Familiarity with Kubernetes storage primitives (e.g., Persistent Volumes, CSI drivers) and integrating storage with containerized workloads.
  • Bonus : Experience with AI / ML data pipelines, including handling large datasets for training and inference.
  • Tech Stack

  • Kubernetes
  • Pulumi
  • Rust and Go
  • Interview Process

    After submitting your application, the team reviews your CV and statement of exceptional work. If your application passes this stage, you will be invited to a 45 minute interview (“phone interview”) during which a member of our team will ask some basic questions. If you clear the initial phone interview, you will enter the main process, which consists of four technical interviews :

  • Coding assessment in Python, Golang, or Rust
  • Systems hands-on : Demonstrate practical skills in a live problem-solving session.
  • Coding assessment or system design discussion based on the candidate's background.
  • Project deep-dive : Present your past exceptional work to a small audience.
  • Every application is reviewed by a member of our technical team. All interviews will be conducted via Google Meet.

    We do not condone usage of AI in interviews and have tools to detect AI usage.

    Benefits

    Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

    Annual Salary Range

    $180,000 - $440,000 USD

    xAI is an equal opportunity employer.

    California Consumer Privacy Act (CCPA) Notice

    Create a job alert for this search

    Site Reliability Engineer • Palo Alto, California, United States

    Related jobs
    Senior Site Reliability Engineer (Cortex)

    Senior Site Reliability Engineer (Cortex)

    Palo Alto Networks • Santa Clara, California, United States
    Full-time
    At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and m...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer - SRE at Descope Los Altos, CA

    Site Reliability Engineer - SRE at Descope Los Altos, CA

    Itlearn360 • Los Altos, CA, United States
    Full-time
    Site Reliability Engineer - SRE job at Descope.Descope R&D group is a skilled team of developers with a unique DNA of creativity,flexibility,anopen mindset. We are looking for a passionate SRE to jo...Show more
    Last updated: 30+ days ago • Promoted
    Sr. Site Reliability Engineer

    Sr. Site Reliability Engineer

    Globality • Palo Alto, California, United States
    Full-time
    Joel Hyatt and Lior Delgo founded Globality with a vision to create prosperous and healthy economies, companies, communities, and individuals. In this new era of the Autonomous Enterprise, Globality...Show more
    Last updated: 30+ days ago • Promoted
    Senior Technology Site Reliability Engineer

    Senior Technology Site Reliability Engineer

    Cooley LLP • Palo Alto, CA, United States
    Full-time
    Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...Show more
    Last updated: 3 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    FLUIX • Palo Alto, CA, United States
    Full-time
    FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure.We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and pow...Show more
    Last updated: 10 days ago • Promoted
    Senior / Staff Site Reliability Engineer

    Senior / Staff Site Reliability Engineer

    Gatik Ai • Mountain View, California, United States
    Full-time
    Gatik, the leader in autonomous middle-mile logistics, is revolutionizing the B2B supply chain with its autonomous transportation-as-a-service (ATaaS) solution and prioritizing safe, consistent del...Show more
    Last updated: 30+ days ago • Promoted
    Senior Site Reliability Engineer (Senior SRE)

    Senior Site Reliability Engineer (Senior SRE)

    Ciroos • Pleasanton, California, United States
    Full-time
    Ciroos (pronounced “Sai rose”) is a seed-stage startup founded in February 2025 by a team of experienced executives and distinguished engineers with deep expertise in observability, AI, distributed...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantum • Palo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Psiquantum • Palo Alto, California, United States
    Full-time
    Quantum computing holds the promise of humanity’s mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more
    Last updated: 30+ days ago • Promoted
    Sr. Reliability Engineer (26861)

    Sr. Reliability Engineer (26861)

    Supermicro • San Jose, CA, United States
    Full-time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Natcast • Sunnyvale, California, United States
    Full-time
    Natcast (short for The National Center for the Advancement of Semiconductor Technology) is a new, purpose-built, non-profit entity created to operate the National Semiconductor Technology Center (N...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Tarana Wireless • Milpitas, California, United States
    Full-time
    Join the Team That's Redefining Wireless Technology.Our groundbreaking Fixed Wireless Access technology is delivering .As a Site Reliability Engineer, you will help us manage software that runs on ...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer - Observability

    Site Reliability Engineer - Observability

    Rivian and Volkswagen Group Technologies • Palo Alto, CA, United States
    Full-time
    Senior Site Reliability Engineer (SRE).RivianVW's Data Platform - Production Engineering team.In this role, you will design, implement, and scale robust observability systems to ensure the health, ...Show more
    Last updated: 18 days ago • Promoted
    Site Reliability Manager

    Site Reliability Manager

    Commscope • Sunnyvale, California, US
    Full-time
    In our ‘always on’ world, we believe it’s essential to have a genuine connection with the work you do.RUCKUS Networks builds and delivers purpose-driven networks that perform in the tough, unique e...Show more
    Last updated: 9 hours ago • Promoted • New!
    Site Reliability Engineer

    Site Reliability Engineer

    Key2Source • San Leandro, California, USA
    Full-time
    Job Title : Site Reliability Engineer.Location : San Leandro CA (Onsite).Engineering experience or equivalent demonstrated through one or a combination of the following : work experience training mili...Show more
    Last updated: 14 days ago • Promoted
    Site Reliability Engineer – Kubernetes

    Site Reliability Engineer – Kubernetes

    Theklicker • Palo Alto, CA, United States
    Full-time
    We are dedicated to being a one-stop solution for purchasing electronic products.With a focus on delivering the best user experience, theklicker empowers users to make informed purchasing decisions...Show more
    Last updated: 7 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Id.me • Mountain View, California, United States
    Full-time
    Consumers can verify their identity with ID.Over 152 million users experience streamlined login and identity verification with ID. More than 600+ consumer brands use ID.Commerce Department and is ap...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Paynearme • Cupertino, California, United States
    Remote
    Full-time
    At PayNearMe, we’re on a mission to make paying and getting paid as simple as possible.We build innovative technology that transforms the way businesses and their customers experience payments.Our ...Show more
    Last updated: 7 days ago • Promoted