Talent.com
Site Reliability Engineer — GPU Infrastructure

Site Reliability Engineer — GPU Infrastructure

GenmoSan Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

Site Reliability Engineer — GPU Infrastructure

Join Genmo, a research lab dedicated to building open, state‑of‑the‑art models for video generation. We are looking for a Site Reliability Engineer to build and operate GPU infrastructure that powers our generative models.

This is a contract‑to‑hire position.

What You’ll Do

  • Own design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.
  • Lead production Kubernetes operations : GPU scheduling, cluster upgrades, multi‑cluster federation.
  • Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.
  • Build CI / CD pipelines, automated testing, and rollout strategies for infra changes.
  • Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.
  • Optimize high‑performance networking (InfiniBand / RDMA) and debug performance bottlenecks.
  • Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.
  • Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

Minimum Qualifications

  • BS / MS / PhD in CS, EE, or related field.
  • 3+ years of SRE / DevOps experience in production; 2+ years managing large Kubernetes fleets.
  • Expert‑level Kubernetes experience.
  • Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator).
  • Experience with GPU schedulers such as Slurm or Kueue.
  • Proficient in Python, Bash and IaC tools (Terraform, Helm, Ansible).
  • Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.
  • Nice to Have

  • Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.
  • Familiarity with CI / CD tooling (GitHub Actions, BuildKit).
  • Prior work with distributed training, model‑serving patterns, or other ML / GPU workloads.
  • Machine‑learning depth is a plus—not a prerequisite. We’ll help you level up if needed.

    Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E‑Verify company and you may review the Notice of E‑Verify Participation and the Right to Work posters in English and Spanish.

    #J-18808-Ljbffr

    Create a job alert for this search

    Site Reliability Engineer • San Francisco, CA, United States

    Related jobs
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    DevOps projectsBerkeley, CA, United States
    Full-time
    LMArena is an engineering-first startup redefining how the world evaluates large language models.Created in 2023 by UC Berkeley researchers, our neutral, community-driven benchmarking platform attr...Show moreLast updated: 4 hours ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ConductorOneSan Francisco, CA, United States
    Full-time
    ConductorOne is the first AI-native identity security platform that protects every identity : human, non-human, and AI.With powerful automation, platform-level AI, and out-of-the-box connectors, it ...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Chainlink LabsSan Francisco, CA, United States
    Full-time
    Chainlink Labs is the primary contributing developer of Chainlink, the decentralized computing platform powering the verifiable web. Chainlink is the industry-standard platform for providing access ...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    FortinetSanta Clara, CA, United States
    Full-time
    At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer I

    Site Reliability Engineer I

    ProsperSan Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show moreLast updated: 23 days ago
    • Promoted
    Senior Site Reliability Engineer – Platform

    Senior Site Reliability Engineer – Platform

    Icon VenturesSan Francisco, CA, United States
    Full-time
    At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.We blend cognitive science with machine learning to personalize and enhance the lear...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantumPalo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood Materials, Inc.San Francisco, CA, United States
    Full-time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Runloop AISan Francisco, CA, United States
    Full-time
    Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show moreLast updated: 27 days ago
    • Promoted
    • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Alembic TechnologiesSan Francisco, CA, United States
    Full-time
    Senior Site Reliability Engineer.This range is provided by Alembic Technologies.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.We’re looking fo...Show moreLast updated: 9 hours ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    AlchemySan Francisco, California, United States
    Full-time
    Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PrimerSan Francisco, CA, United States
    Full-time
    Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Sigmaways IncSan Francisco, California, United States
    Full-time
    As a Site reliability engineer, you will partner with development and IT teams to implement CI / CD pipelines, develop automation and monitoring solutions to ensure our platforms are secure, scalable...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    Hinge-HealthSan Francisco, CA, United States
    Full-time
    Site Reliability Engineers at Hinge Health are infrastructure engineers with a strong sense of ownership over the systems that keep our platform running reliably, securely, and efficiently.From sca...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    P2PSan Francisco, CA, United States
    Full-time
    Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    SpeakSan Francisco, CA, United States
    Full-time
    Our mission is to reinvent the way people learn, starting with language.Learning a language can change a life by opening doors to new cultures, careers, and communities. Two billion people around th...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    Hinge HealthSan Francisco, CA, United States
    Full-time
    From scaling Kubernetes clusters to improving observability with Datadog, we build the tooling and automation that empower product teams to ship with confidence. Collaborate with engineering teams t...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer - Kubernetes Platform

    Site Reliability Engineer - Kubernetes Platform

    Pantera CapitalPalo Alto, CA, United States
    Full-time
    AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...Show moreLast updated: 16 days ago