Talent.com
Site Reliability Engineer — GPU Infrastructure
Site Reliability Engineer — GPU InfrastructureGenmo • San Francisco, CA, United States
Site Reliability Engineer — GPU Infrastructure

Site Reliability Engineer — GPU Infrastructure

Genmo • San Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

Site Reliability Engineer — GPU Infrastructure

Join Genmo, a research lab dedicated to building open, state‑of‑the‑art models for video generation. We are looking for a Site Reliability Engineer to build and operate GPU infrastructure that powers our generative models.

This is a contract‑to‑hire position.

What You’ll Do

  • Own design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.
  • Lead production Kubernetes operations : GPU scheduling, cluster upgrades, multi‑cluster federation.
  • Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.
  • Build CI / CD pipelines, automated testing, and rollout strategies for infra changes.
  • Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.
  • Optimize high‑performance networking (InfiniBand / RDMA) and debug performance bottlenecks.
  • Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.
  • Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

Minimum Qualifications

  • BS / MS / PhD in CS, EE, or related field.
  • 3+ years of SRE / DevOps experience in production; 2+ years managing large Kubernetes fleets.
  • Expert‑level Kubernetes experience.
  • Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator).
  • Experience with GPU schedulers such as Slurm or Kueue.
  • Proficient in Python, Bash and IaC tools (Terraform, Helm, Ansible).
  • Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.
  • Nice to Have

  • Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.
  • Familiarity with CI / CD tooling (GitHub Actions, BuildKit).
  • Prior work with distributed training, model‑serving patterns, or other ML / GPU workloads.
  • Machine‑learning depth is a plus—not a prerequisite. We’ll help you level up if needed.

    Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E‑Verify company and you may review the Notice of E‑Verify Participation and the Right to Work posters in English and Spanish.

    #J-18808-Ljbffr

    Create a job alert for this search

    Site Reliability Engineer • San Francisco, CA, United States

    Related jobs
    Site Reliability Engineer

    Site Reliability Engineer

    ConductorOne • San Francisco, CA, United States
    Full-time
    ConductorOne is the first AI-native identity security platform that protects every identity : human, non-human, and AI.With powerful automation, platform-level AI, and out-of-the-box connectors, it ...Show more
    Last updated: 30+ days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Chainlink Labs • San Francisco, CA, United States
    Full-time
    Chainlink Labs is the primary contributing developer of Chainlink, the decentralized computing platform powering the verifiable web. Chainlink is the industry-standard platform for providing access ...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer I

    Site Reliability Engineer I

    Prosper • San Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Latent • San Francisco, CA, United States
    Full-time
    Location : San Francisco, CA (5 Days In-Office).You are the infrastructure expert who enables our rapid product development and guarantees. AI platform for major health systems.Your focus on operatio...Show more
    Last updated: 30+ days ago • Promoted
    Senior Site Reliability Engineer – Platform

    Senior Site Reliability Engineer – Platform

    Icon Ventures • San Francisco, CA, United States
    Full-time
    At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.We blend cognitive science with machine learning to personalize and enhance the lear...Show more
    Last updated: 10 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Alchemy • San Francisco, CA, United States
    Full-time
    Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer I

    Site Reliability Engineer I

    Prosper Marketplace • San Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show more
    Last updated: 30+ days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Alembic • San Francisco, CA, United States
    Full-time
    We’re looking for an experienced.Site Reliability Engineer (SRE).You’ll partner with engineers and data scientists to build, automate, and maintain the infrastructure that powers our core platform—...Show more
    Last updated: 10 days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Alembic Technologies • San Francisco, CA, United States
    Full-time
    Senior Site Reliability Engineer.This range is provided by Alembic Technologies.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.We’re looking fo...Show more
    Last updated: 9 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Primer • San Francisco, CA, United States
    Full-time
    Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Fractal • San Francisco, CA, United States
    Full-time
    This range is provided by Fractal.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Fractal Analytics is a strategic AI partner to Fortune 500 com...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    Hinge-Health • San Francisco, CA, United States
    Full-time
    Site Reliability Engineers at Hinge Health are infrastructure engineers with a strong sense of ownership over the systems that keep our platform running reliably, securely, and efficiently.From sca...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Speak • San Francisco, CA, United States
    Full-time
    Our mission is to reinvent the way people learn, starting with language.Learning a language can change a life by opening doors to new cultures, careers, and communities. Two billion people around th...Show more
    Last updated: 10 days ago • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    Hinge Health • San Francisco, CA, United States
    Full-time
    From scaling Kubernetes clusters to improving observability with Datadog, we build the tooling and automation that empower product teams to ship with confidence. Collaborate with engineering teams t...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    P2P • San Francisco, CA, United States
    Full-time
    Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Cypress HCM • San Francisco, CA, United States
    Full-time
    This range is provided by Cypress HCM.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. As a Site Reliability Engineer (Contractor), you will be a...Show more
    Last updated: 6 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    SOLANA FOUNDATION • San Francisco, CA, United States
    Full-time
    Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show more
    Last updated: 3 days ago • Promoted
    Senior / Principal Site Reliability Engineer

    Senior / Principal Site Reliability Engineer

    Datacrunch • San Francisco, CA, United States
    Full-time +1
    Imagine a future where everyone has instant, low-cost access to intelligence.We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI m...Show more
    Last updated: 24 days ago • Promoted