Talent.com
Senior SRE & Infra Engineer (GPU Cluster Platform Reliability & Infrastructure Engineer)

Senior SRE & Infra Engineer (GPU Cluster Platform Reliability & Infrastructure Engineer)

Macpower Digital Assets EdgeSan Francisco, CA, United States
9 days ago
Job type
  • Full-time
Job description

This hybrid role spans across platform reliability and infrastructure engineering. You'll be instrumental in ensuring high availability, fault tolerance, and performance across internal research and external customers' GPU cluster environments. Responsibilities include automating GPU cluster onboarding, enhancing monitoring, logging, and security systems, and developing new backend features.

Required Skills and Certifications :

  • Proven experience with monitoring tools (e.g., Prometheus, Grafana) and incident management practice.
  • Strong skills in infrastructure automation with Ansible, Terraform, or similar.
  • Deep understanding of logging frameworks, alerting systems, and proactive monitoring solutions.
  • Proficiency in Python for developing automation scripts, REST APIs, and backend support tools.
  • Hands-on experience with Kubernetes and cloud platforms (GCP preferred).
  • Knowledge of high-performance networking and real-time systems.
Create a job alert for this search

Engineer Sre • San Francisco, CA, United States

Related jobs
  • Promoted
SRE Engineer

SRE Engineer

Syntricate TechnologiesSan Jose, CA, United States
Full-time
Extensive experience working with linux flavors like rhel / centos os, shells, filesystems and utilities.Knowledge of distributed computing and experience working with container orchestration framewo...Show moreLast updated: 9 days ago
  • Promoted
Senior Site Reliability Engineer (Cloud Infra)

Senior Site Reliability Engineer (Cloud Infra)

Mumba Technologies, Inc.Palo Alto, CA, US
Full-time
We are seeking a highly skilled.Senior Site Reliability Engineer.In this role responsibilities will include designing and implementing infrastructure automation, continuous integration and delivery...Show moreLast updated: 6 days ago
  • Promoted
Senior DGX Cloud AI Infrastructure Software Engineer

Senior DGX Cloud AI Infrastructure Software Engineer

NVIDIASanta Clara, CA, United States
Full-time
Joining NVIDIA's DGX Cloud AI Efficiency Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of AI workloa...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer (SRE)

Senior Site Reliability Engineer (SRE)

ACL DigitalMountain View, CA, United States
Full-time
Title : Senior Site Reliability Engineer (SRE).Design, develop, and maintain automation frameworks for performance testing and monitoring of QuickBooks infrastructure. Ensure the scalability and reli...Show moreLast updated: 9 days ago
  • Promoted
Senior Site Reliability Engineer, BCM - DGX Cloud

Senior Site Reliability Engineer, BCM - DGX Cloud

NVIDIASanta Clara, CA, United States
Full-time
Senior Site Reliability Engineer, BCM - DGX Cloud page is loaded## Senior Site Reliability Engineer, BCM - DGX Cloudlocations : US, CA, Santa Clara : US, Remotetime type : Full timeposted on : Posted Y...Show moreLast updated: 9 days ago
  • Promoted
Senior Site Reliability Engineer Cloud Platform

Senior Site Reliability Engineer Cloud Platform

ZillizRedwood City, CA, United States
Full-time
Zilliz is a fast-growing startup developing the industry's leading vector database company for enterprise-grade AI.Founded by the engineers behind Milvus, the world's most popular open-source vecto...Show moreLast updated: 30+ days ago
  • Promoted
Infra Lead

Infra Lead

Speak LLCSan Francisco, CA, United States
Full-time
As an SRE Engineer, Lead at Speak, you'll be the driving force behind the reliability and resilience of the systems that power our global language learning experience. You'll lead efforts to scale o...Show moreLast updated: 30+ days ago
  • Promoted
Senior Infrastructure Engineer

Senior Infrastructure Engineer

CrusoeSan Francisco, CA, United States
Full-time
Crusoe's mission is to accelerate the abundance of energy and intelligence.We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, spe...Show moreLast updated: 1 day ago
  • Promoted
Senior Cloud Infrastructure Engineer

Senior Cloud Infrastructure Engineer

Harrison ClarkeSan Francisco, CA, United States
Full-time
Annual Bonus, Sign-on bonus, RSUs, and Stock options.Join a dynamic startup seeking an infrastructure specialist to design, scale, and maintain cutting-edge infrastructure that powers innovative di...Show moreLast updated: 1 day ago
  • Promoted
Senior Infrastructure Engineer

Senior Infrastructure Engineer

AngelListSan Francisco, CA, United States
Full-time
We exist to accelerate innovation.We do this by giving more people the opportunity to participate in the venture economy by building the financial infrastructure that makes it possible for more peo...Show moreLast updated: 1 day ago
  • Promoted
Cluster Infrastructure Engineer

Cluster Infrastructure Engineer

Cartesia, Inc.San Francisco, CA, United States
Full-time
Our mission is to build the next generation of AI : ubiquitous, interactive intelligence that runs wherever you are.Today, not even the best models can continuously process and reason over a year-lo...Show moreLast updated: 1 day ago
  • Promoted
Senior Cloud Infrastructure Engineer

Senior Cloud Infrastructure Engineer

Omni Analytics, Inc.San Francisco, CA, United States
Full-time
Omni gives businesses one place to easily analyze all their data.Built by the teams behind Looker and Stitch, Omni combines data models, a point-and-click UI, spreadsheet formulas, and powerful vis...Show moreLast updated: 30+ days ago
  • Promoted
Site Reliability Engineer (SRE) / DevOps Engineer

Site Reliability Engineer (SRE) / DevOps Engineer

Diverse LynxSunnyvale, CA, United States
Full-time
BS / MS in Computer Science or Equivalent • At least 8+ years in a Reliability Engineering, DevOps or infrastructure focused role • Advanced experience with programming languages (Python, Java) • Passio...Show moreLast updated: 9 days ago
  • Promoted
Senior Infrastructure Engineer

Senior Infrastructure Engineer

Recruiting from ScratchSan Francisco, CA, United States
Full-time
Who is Recruiting from Scratch : .Recruiting from Scratch is a specialized talent firm dedicated to helping companies build exceptional teams. We partner closely with our clients to deeply understand ...Show moreLast updated: 1 day ago
  • Promoted
Senior AI Infrastructure Engineer

Senior AI Infrastructure Engineer

LanceDBSan Francisco, CA, United States
Full-time
LanceDB is a developer-friendly, open-source data lake for multimodal AI.From hyper-scalable vector search to advanced retrieval for RAG, from streaming training data to interactive exploration of ...Show moreLast updated: 1 day ago
  • Promoted
Senior / Staff Platform Engineer / SRE

Senior / Staff Platform Engineer / SRE

Flow MDStanford, CA, United States
Full-time
Senior / Staff Platform Engineer / SRE.Palo Alto, CA / New York, NY / Miami, FL.At Flow, we're on a mission to enhance living experiences across communities by leveraging the power of technology.Our fo...Show moreLast updated: 8 days ago
  • Promoted
SRE Senior Engineer

SRE Senior Engineer

Laiba Technologies LLCSan Jose, CA, United States
Full-time +1
Develop and maintain services to meet reliability and scalability demands.Develop and enhance monitoring services.Extend and build new libraries for cross-cutting concerns that comprise / extend to t...Show moreLast updated: 1 day ago
  • Promoted
Senior ML infrastructure engineer

Senior ML infrastructure engineer

KuzcoSan Francisco, CA, United States
Full-time
Kuzco is seeking a Senior ML Infrastructure Engineer to join our team.This role involves developing large-scale, fault-tolerant systems that handle millions of large language model inference reques...Show moreLast updated: 1 day ago