Talent.com
Site Reliability Engineer - Supercomputing

Site Reliability Engineer - Supercomputing

XaiSan Francisco, CA, United States
19 hours ago
Job type
  • Full-time
Job description

Site Reliability Engineer - Supercomputing

We are seeking a talented Site Reliability Engineer (SRE) to join our SuperComputing team. In this role, you'll ensure the reliability, scalability, and performance of our high-performance computing (HPC) infrastructure, powering cutting-edge AI research. You'll collaborate with cross-functional teams to build and maintain systems that support massive-scale data processing and model training. You'll ensure Grok stays reliable for millions while inventing new approaches at the intersection of SRE and cutting-edge AI to help define the future of AI reliability engineering.

What You'll Do

  • Design, implement, and maintain robust, scalable infrastructure for supercomputing environments.
  • Monitor and optimize system performance, ensuring high availability and minimal downtime.
  • Develop automation tools and scripts to streamline operations and improve system reliability.
  • Troubleshoot complex issues across distributed systems, networks, and storage solutions.
  • Collaborate with AI researchers and engineers to support compute-intensive workloads.
  • Implement security best practices to protect sensitive data and infrastructure.
  • Contribute to capacity planning and disaster recovery strategies.
  • Participate in an on-call rotation to ensure 24 / 7 system reliability.

Ideal Experiences

  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • 3+ years of experience in site reliability engineering, DevOps, or systems engineering.
  • Proficiency in Linux system administration and scripting (e.g., Python, Bash).
  • Experience with containerization (e.g., Docker, Kubernetes) and cloud platforms (e.g., AWS, GCP, Azure).
  • Strong understanding of networking, distributed systems, and storage technologies.
  • Familiarity with HPC environments, GPU clusters, or large-scale data processing.
  • Excellent problem-solving skills and ability to work in a fast-paced, dynamic environment.
  • Strong communication skills and a collaborative mindset.
  • Bonus : Experience with Infrastructure as Code (e.g., Terraform, Ansible) or monitoring tools (e.g., Prometheus, Grafana).
  • Location

    This role is based in the Bay Area (San Francisco and Palo Alto). Candidates are expected to be located near the Bay Area or open to relocation.

    Tech Stack

  • Languages : Rust, Python, C++, Golang
  • Interview Process

  • Application Review : Submit your CV and a statement of exceptional work. Our team will review your application to assess fit.
  • Phone Interview (45 minutes) : A brief conversation with a team member to discuss your background, key accomplishments, and motivation.
  • Main Interview Process
  • 1 Coding assessment : Solve problems in Rust, Python, C++, or Golang

  • 1 Skill Specific Technical Interview : Demonstrate practical skills in a live problem-solving session.
  • 1 SRE / System Case Study : Analyze and solve a complex, real-world system design or operational problem, demonstrating your technical expertise, problem-solving skills, and ability to optimize system reliability and performance.
  • Project Deep-Dive : Present your past exceptional work to a small audience.
  • Annual Salary Range

    $180,000 - $440,000 USD

    Benefits

    Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

    Create a job alert for this search

    Site Reliability Engineer • San Francisco, CA, United States

    Related jobs
    • Promoted
    • New!
    Staff / Principal Site Reliability Engineer

    Staff / Principal Site Reliability Engineer

    VezaSan Francisco, CA, United States
    Full-time
    Staff / Principal Site Reliability Engineer.We are seeking an exceptional Staff / Principal Site Reliability Engineer to lead critical infrastructure initiatives and drive Innovation across our organiz...Show moreLast updated: 6 hours ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Insight GlobalSanta Clara, CA, US
    Full-time
    Insight Global is looking for a seasoned SRE to join one of our largest technology clients' multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working...Show moreLast updated: 26 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    FortinetSunnyvale, CA, United States
    Full-time
    At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...Show moreLast updated: 6 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantumPalo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
    • Promoted
    Reliability Engineer (Rotating Equipment)

    Reliability Engineer (Rotating Equipment)

    Advantage TechnicalRodeo, CA, United States
    Full-time
    Reliability Engineer (Rotating Equipment).Contract : 1 year, could extend.Bachelor’s degree in mechanical engineering or related technical discipline. Minimum 5 years’ rotating equipment reliability ...Show moreLast updated: 6 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood Materials, Inc.San Francisco, CA, United States
    Full-time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CompunnelRichmond, CA, United States
    Full-time
    The Site Reliability Engineer will be responsible for ensuring the reliability, availability, and performance of applications and services as part of the transition from private to public cloud.Thi...Show moreLast updated: 5 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Runloop AISan Francisco, CA, United States
    Full-time
    Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show moreLast updated: 15 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    WorkOSSan Francisco, CA, United States
    Full-time
    About WorkOS 🚀 WorkOS builds tools and services for developers to help them implement authentication, identity, authorization, and overall enterprise readiness. We’re a fully distributed team with ...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PSI QuantumPalo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    XaiPalo Alto, CA, United States
    Full-time
    AIs mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellen...Show moreLast updated: 6 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ReplitFoster City, CA, United States
    Full-time
    Replit is the agentic software creation platform that enables anyone to build applications using natural language.With millions of users worldwide and over 500,000 business users, Replit is democra...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    FractalSan Francisco, CA, United States
    Full-time
    This range is provided by Fractal.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Fractal Analytics is a strategic AI partner to Fortune 500 com...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PrimerSan Francisco, CA, United States
    Full-time
    Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer - Supercomputing

    Site Reliability Engineer - Supercomputing

    XaiPalo Alto, CA, United States
    Full-time
    Site Reliability Engineer - Supercomputing.We are seeking a talented Site Reliability Engineer (SRE) to join our SuperComputing team. In this role, you'll ensure the reliability, scalability, and pe...Show moreLast updated: 6 days ago
    • Promoted
    Site Reliability Engineer I

    Site Reliability Engineer I

    Prosper.comSan Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show moreLast updated: 6 days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    TogetherSan Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show moreLast updated: 21 hours ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    P2PSan Francisco, CA, United States
    Full-time
    Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show moreLast updated: 30+ days ago