Talent.com
Senior Site Reliability Engineer GPU Infrastructure

Senior Site Reliability Engineer GPU Infrastructure

GenmoSan Francisco, California, United States
30+ days ago
Job type
  • Full-time
Job description

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

What You’ll Do

Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.

Lead production Kubernetes operations : GPU scheduling, cluster upgrades, multi‑cluster federation.

Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.

Build CI / CD pipelines, automated testing, and rollout strategies for infra changes.

Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.

Optimize high‑performance networking (InfiniBand / RDMA) and debug perf bottlenecks.

Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.

Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

Minimum Qualifications

BS / MS / PhD in CS, EE, or related field.

3+ yrs SRE / DevOps in production; 2+ yrs managing large Kubernetes fleets.

Expert‑level Kubernetes experience.

Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).

Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.

Nice to Have

Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.

Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator)

GPU schedulers such as Slurm or Kueue.

Familiarity with CI / CD tooling (GitHub Actions, BuildKit).

Prior work with distributed training, model‑serving patterns, or other ML / GPU workloads.

Machine‑learning depth is a plus—not a prerequisite. We’ll help you level up if needed.

Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish .

Create a job alert for this search

Senior Site Reliability Engineer • San Francisco, California, United States

Related jobs
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Chainlink LabsSan Francisco, CA, United States
Full-time
Chainlink Labs is the primary contributing developer of Chainlink, the decentralized computing platform powering the verifiable web. Chainlink is the industry-standard platform for providing access ...Show moreLast updated: 30+ days ago
  • Promoted
  • New!
Site Reliability Engineer

Site Reliability Engineer

DevOps projectsBerkeley, CA, United States
Full-time
LMArena is an engineering-first startup redefining how the world evaluates large language models.Created in 2023 by UC Berkeley researchers, our neutral, community-driven benchmarking platform attr...Show moreLast updated: 21 hours ago
  • Promoted
Principal Site Reliability Engineer

Principal Site Reliability Engineer

FortinetSanta Clara, CA, United States
Full-time
At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer – Platform

Senior Site Reliability Engineer – Platform

Icon VenturesSan Francisco, CA, United States
Full-time
At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.We blend cognitive science with machine learning to personalize and enhance the lear...Show moreLast updated: 1 day ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

The Recruiting GuySan Francisco, CA, United States
Full-time
Be among the first 25 applicants.This range is provided by The Recruiting Guy.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Senior Cloud Infra...Show moreLast updated: 2 days ago
  • Promoted
Site Reliability Engineer I

Site Reliability Engineer I

ProsperSan Francisco, CA, United States
Full-time
As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show moreLast updated: 23 days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States
Full-time
Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
  • Promoted
Senior Staff Site Reliability Engineer - Platform

Senior Staff Site Reliability Engineer - Platform

Icon VenturesSan Francisco, CA, United States
Full-time
At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.Our $1B+ learning platform serves tens of millions of students every month, includin...Show moreLast updated: 1 day ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

CorelightSan Francisco, CA, United States
Full-time
Senior Site Reliability Engineer.We are looking for a Senior Site Reliability Engineer to design, automate, and scale cloud and hybrid platforms that power AI / ML workloads and SaaS services.You\'ll...Show moreLast updated: 4 days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

AlembicSan Francisco, CA, United States
Full-time
We’re looking for an experienced.Site Reliability Engineer (SRE).You’ll partner with engineers and data scientists to build, automate, and maintain the infrastructure that powers our core platform—...Show moreLast updated: 2 days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Alembic TechnologiesSan Francisco, CA, United States
Full-time
Senior Site Reliability Engineer.This range is provided by Alembic Technologies.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.We’re looking fo...Show moreLast updated: 1 day ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

CheckrSan Francisco, CA, United States
Full-time
Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 2 days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Loft OrbitalSan Francisco, CA, United States
Full-time
Senior Site Reliability Engineer.This range is provided by Loft Orbital.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Loft Orbital is revoluti...Show moreLast updated: 30+ days ago
  • Promoted
Senior Staff Site Reliability Engineer - Platform

Senior Staff Site Reliability Engineer - Platform

QuizletSan Francisco, CA, United States
Full-time
At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.Our $1B+ learning platform serves tens of millions of students every month, includin...Show moreLast updated: 2 days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

HiveSan Francisco, CA, United States
Full-time
Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

CircleSan Francisco, CA, United States
Full-time
Senior Site Reliability Engineer at Circle.Circle is a financial technology company at the epicenter of the emerging internet of money. Our infrastructure—including USDC, a blockchain‑based dollar—h...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

AppOmniSan Francisco, CA, United States
Full-time
AppOmni, a leader in SaaS Security, helps customers achieve secure productivity with their applications.Security teams and owners can quickly detect and mitigate threats using unmatched depth of pr...Show moreLast updated: 4 days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Gridware Technologies Inc.San Francisco, CA, United States
Full-time
Gridware is a San Francisco-based technology company dedicated to protecting and enhancing the electrical grid.We pioneered a groundbreaking new class of grid management called active grid response...Show moreLast updated: 30+ days ago