Talent.com
Site Reliability Engineer - Kubernetes Platform

Site Reliability Engineer - Kubernetes Platform

xAIPalo Alto, CA, US
7 days ago
Job type
  • Full-time
Job description

Job Description

Job Description

About xAI

xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company's mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

We are seeking a highly skilled Senior Site Reliability Storage Engineer to join our mission-driven team, focusing on designing, building, and optimizing Kubernetes clusters across multiple regions. In this role, you will leverage your expertise in Kubernetes orchestration and distributed systems to enhance the reliability, performance, and cost-effectiveness of xAI's infrastructure. You will collaborate closely with engineering teams to deliver robust, scalable solutions that support large-scale AI workloads. The ideal candidate is passionate about automation, observability, and ensuring the integrity of critical systems in a fast-paced, innovative environment.

Responsibilities

  • Develop and optimize software to provision and manage Kubernetes clusters on-premises, enabling xAI to scale efficiently.
  • Enhance the reliability, performance, and cost-effectiveness of Kubernetes infrastructure to support large-scale AI and application workloads.
  • Collaborate with xAI engineers to understand workload requirements and design tailored Kubernetes solutions to meet their needs.
  • Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems.
  • Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible.
  • Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs.
  • Contribute to the Kubernetes stack, including expertise in CNI, CRI, CSI, and related components.
  • This is an in-person role based in Palo Alto, CA, with up to 25% travel required.

Required Qualifications

  • 5+ years of experience as a Site Reliability Engineer or similar role, with a focus on building and maintaining reliable, scalable systems.
  • Proven expertise in managing Kubernetes infrastructure using tools like Cluster API (CAPI) and kubeadm.
  • Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible.
  • Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components.
  • Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs / SLOs.
  • Preferred Qualifications

  • Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments.
  • Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience.
  • Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation.
  • Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges.
  • Passion for problem-solving and a proactive drive to deliver impactful results.
  • A sense of adventure and humor to navigate challenges with a positive mindset.
  • Annual Salary Range

    $180,000 - $440,000 USD

    Benefits

    Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

    xAI is an equal opportunity employer.

    California Consumer Privacy Act (CCPA) Notice

    Create a job alert for this search

    Site Reliability Engineer • Palo Alto, CA, US

    Related jobs
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    AlchemySan Francisco, California, US
    Full-time
    If you are considering sending an application, make sure to hit the apply button below after reading through the entire description. Our mission is to bring web3 to a billion people, by providing bu...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Syntricate TechnologiesSan Jose, CA, United States
    Full-time
    Extensive experience working with linux flavors like rhel / centos os, shells, filesystems and utilities.Knowledge of distributed computing and experience working with container orchestration framewo...Show moreLast updated: 5 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    FortinetSunnyvale, CA, United States
    Full-time
    At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...Show moreLast updated: 5 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantumPalo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Diverse LynxSan Francisco, CA, United States
    Full-time
    Role : Site Reliability Engineer.Location : RTP, NC / San Jose, CA (Onsite).SRE, NetApp Storage, Linux Certified, Kubernetes Certified, DevOps, Docker, etc. Experienced Senior SRE working on Kubernetes...Show moreLast updated: 5 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    WorkOSSan Francisco, CA, United States
    Full-time
    About WorkOS 🚀 WorkOS builds tools and services for developers to help them implement authentication, identity, authorization, and overall enterprise readiness. We’re a fully distributed team with ...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Together AISan Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Runloop AISan Francisco, CA, United States
    Full-time
    Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show moreLast updated: 14 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PSI QuantumPalo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ReplitFoster City, CA, United States
    Full-time
    Replit is the agentic software creation platform that enables anyone to build applications using natural language.With millions of users worldwide and over 500,000 business users, Replit is democra...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    P2PSan Francisco, CA, United States
    Full-time
    Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer I

    Site Reliability Engineer I

    Prosper.comSan Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show moreLast updated: 5 days ago
    • Promoted
    Staff Site Reliability Engineer - Kubernetes

    Staff Site Reliability Engineer - Kubernetes

    FivetranOakland, CA, United States
    Full-time
    From Fivetran's founding until now, our mission has remained the same : to make access to data as simple and reliable as electricity. With Fivetran, customer data arrives in their warehouses, canonic...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer - Supercomputing

    Site Reliability Engineer - Supercomputing

    XaiPalo Alto, CA, United States
    Full-time
    Site Reliability Engineer - Supercomputing.We are seeking a talented Site Reliability Engineer (SRE) to join our SuperComputing team. In this role, you'll ensure the reliability, scalability, and pe...Show moreLast updated: 5 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PrimerSan Francisco, California, US
    Full-time
    Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer - Openstack

    Site Reliability Engineer - Openstack

    FortinetSunnyvale, CA, United States
    Full-time
    Fortinet is recruiting a Site Reliability Engineer- OPENSTACK to join our FortiStack team.This team is responsible for the management, operation and continued development of our Openstack-based pri...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Rockwoods IncPleasanton, CA, US
    Full-time
    Note : Candidates must have relevant experience in Medical / Healthcare domains, this is mandatory.Senior SRE Engineer - Pleasanton, 5 days office. Primary work : 24x7 On-call support and setting up mo...Show moreLast updated: 23 days ago
    • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    Hinge HealthSan Francisco, California, US
    Full-time
    Ensure all your application information is up to date and in order before applying for this opportunity.From scaling Kubernetes clusters to improving observability with Datadog, we build the toolin...Show moreLast updated: 1 day ago