Talent.com
Senior Site Reliability Engineer GPU Infrastructure
Senior Site Reliability Engineer GPU InfrastructureGenmo • San Francisco, California, United States
Senior Site Reliability Engineer GPU Infrastructure

Senior Site Reliability Engineer GPU Infrastructure

Genmo • San Francisco, California, United States
30+ days ago
Job type
  • Full-time
Job description

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

What You’ll Do

Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.

Lead production Kubernetes operations : GPU scheduling, cluster upgrades, multi‑cluster federation.

Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.

Build CI / CD pipelines, automated testing, and rollout strategies for infra changes.

Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.

Optimize high‑performance networking (InfiniBand / RDMA) and debug perf bottlenecks.

Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.

Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

Minimum Qualifications

BS / MS / PhD in CS, EE, or related field.

3+ yrs SRE / DevOps in production; 2+ yrs managing large Kubernetes fleets.

Expert‑level Kubernetes experience.

Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).

Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.

Nice to Have

Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.

Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator)

GPU schedulers such as Slurm or Kueue.

Familiarity with CI / CD tooling (GitHub Actions, BuildKit).

Prior work with distributed training, model‑serving patterns, or other ML / GPU workloads.

Machine‑learning depth is a plus—not a prerequisite. We’ll help you level up if needed.

Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish .

Create a job alert for this search

Senior Site Reliability Engineer • San Francisco, California, United States

Related jobs
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Chainlink Labs • San Francisco, CA, United States
Full-time
Chainlink Labs is the primary contributing developer of Chainlink, the decentralized computing platform powering the verifiable web. Chainlink is the industry-standard platform for providing access ...Show more
Last updated: 30+ days ago • Promoted
Senior Site Reliability Engineer, Compute

Senior Site Reliability Engineer, Compute

Crusoe • San Francisco, CA, United States
Full-time
Senior Site Reliability Engineer, Compute.Crusoe's mission is to accelerate the abundance of energy and intelligence.We’re crafting the engine that powers a world where people can create ambitiousl...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer — GPU Infrastructure

Site Reliability Engineer — GPU Infrastructure

Genmo • San Francisco, CA, United States
Full-time
Site Reliability Engineer — GPU Infrastructure.Join Genmo, a research lab dedicated to building open, state‑of‑the‑art models for video generation. We are looking for a Site Reliability Engineer to ...Show more
Last updated: 30+ days ago • Promoted
Senior Site Reliability Engineer – Platform

Senior Site Reliability Engineer – Platform

Icon Ventures • San Francisco, CA, United States
Full-time
At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.We blend cognitive science with machine learning to personalize and enhance the lear...Show more
Last updated: 10 days ago • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

The Recruiting Guy • San Francisco, CA, United States
Full-time
Be among the first 25 applicants.This range is provided by The Recruiting Guy.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Senior Cloud Infra...Show more
Last updated: 10 days ago • Promoted
Site Reliability Engineer I

Site Reliability Engineer I

Prosper • San Francisco, CA, United States
Full-time
As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

Latent • San Francisco, CA, United States
Full-time
Location : San Francisco, CA (5 Days In-Office).You are the infrastructure expert who enables our rapid product development and guarantees. AI platform for major health systems.Your focus on operatio...Show more
Last updated: 30+ days ago • Promoted
Senior Site Reliability Engineer - Fleet Reliability

Senior Site Reliability Engineer - Fleet Reliability

Lambda • San Francisco, CA, United States
Full-time
Senior Site Reliability Engineer - Fleet Reliability.Senior Site Reliability Engineer - Fleet Reliability.Senior Site Reliability Engineer - Fleet Reliability. Senior Site Reliability Engineer - Fle...Show more
Last updated: 30+ days ago • Promoted
Senior Site Reliability Engineer, Compute

Senior Site Reliability Engineer, Compute

Epoch Biodesign • San Francisco, CA, United States
Full-time
Crusoe's mission is to accelerate the abundance of energy and intelligence.We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, spe...Show more
Last updated: 3 days ago • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Loft Orbital • San Francisco, CA, United States
Full-time
Loft Orbital is revolutionizing access to space by building reliable, shareable satellites that drastically reduce the time and complexity traditionally required to get to orbit.We operate satellit...Show more
Last updated: 30+ days ago • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Alembic Technologies • San Francisco, CA, United States
Full-time
Senior Site Reliability Engineer.This range is provided by Alembic Technologies.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.We’re looking fo...Show more
Last updated: 9 days ago • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Checkr • San Francisco, CA, United States
Full-time
Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show more
Last updated: 10 days ago • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Hive • San Francisco, CA, United States
Full-time
Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show more
Last updated: 30+ days ago • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Circle • San Francisco, CA, United States
Full-time
Senior Site Reliability Engineer at Circle.Circle is a financial technology company at the epicenter of the emerging internet of money. Our infrastructure—including USDC, a blockchain‑based dollar—h...Show more
Last updated: 30+ days ago • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

AppOmni • San Francisco, CA, United States
Full-time
AppOmni, a leader in SaaS Security, helps customers achieve secure productivity with their applications.Security teams and owners can quickly detect and mitigate threats using unmatched depth of pr...Show more
Last updated: 13 days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

P2P • San Francisco, CA, United States
Full-time
Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show more
Last updated: 30+ days ago • Promoted
Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader

Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader

Andiamo • San Francisco, CA, United States
Full-time
Senior Site Reliability Engineer / HPC - Pre-IPO Tech Leader.Sr Site Reliability Engineer / HPC – Pre-IPO Tech Leader.We are seeking a highly skilled. Senior Site Reliability Engineer (SRE) / High-P...Show more
Last updated: 1 day ago • Promoted
Senior / Principal Site Reliability Engineer

Senior / Principal Site Reliability Engineer

Datacrunch • San Francisco, CA, United States
Full-time +1
Imagine a future where everyone has instant, low-cost access to intelligence.We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI m...Show more
Last updated: 24 days ago • Promoted