Talent.com
Senior Cluster Site Reliability Engineer

Senior Cluster Site Reliability Engineer

JobgetherCA, US
1 day ago
Job type
  • Full-time
  • Remote
  • Quick Apply
Job description

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Cluster Site Reliability Engineer in California (USA) .

This role is designed for a highly skilled engineer to ensure the reliability, scalability, and performance of critical research compute clusters. You will maintain and optimize both on-premises and cloud infrastructure while implementing automation and SRE best practices. Working closely with engineering and research teams, you will solve real-time operational issues, drive systemic improvements, and build observability frameworks to monitor cluster health. Your work will directly impact cutting-edge machine learning research, enabling teams to operate efficiently at scale. This position offers the opportunity to apply your technical expertise to complex distributed systems and HPC environments while collaborating with a high-performing, innovative team.

Accountabilities :

  • Act as a first responder to cluster outages or performance issues, triaging and resolving urgent problems efficiently.
  • Maintain high uptime and define, track, and report on SLAs to quantify reliability.
  • Diagnose recurring systemic issues and engineer long-term solutions in collaboration with engineering teams.
  • Develop and maintain observability and monitoring frameworks, including custom metrics for cluster health.
  • Support policy design for fair cluster usage and implement enforcement mechanisms for research teams.
  • Forecast cluster growth, optimize scaling strategies, and improve operational efficiency across cost, performance, and usability dimensions.
  • Collaborate with software and research teams to support distributed computing and machine learning workflows.

Requirements

  • 5+ years of experience in SRE, DevOps, or similar senior engineering roles.
  • Expertise in HPC / batch compute frameworks (Slurm, Kueue, AWS / GCP Batch) and / or ML training systems (Kubeflow, MLflow, Horovod).
  • Proficiency in scripting (Python, Ruby, or similar) and infrastructure-as-code / configuration management (Terraform, Ansible).
  • Hands-on experience with cloud platforms (AWS or GCP) and distributed storage systems (Lustre, Ceph, S3).
  • Strong familiarity with observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry).
  • Bachelor’s degree in Computer Science or equivalent experience.
  • Systematic, automation-driven mindset with a focus on reliability engineering.
  • Benefits

  • Experience with HPC frameworks, Kubernetes-based job orchestrators, and distributed computing frameworks (Ray, Dask, Spark).
  • Knowledge of ML frameworks (PyTorch, TensorFlow, JAX, Horovod, DeepSpeed).
  • Experience with hybrid or on-prem / cloud environments and HPC networking (InfiniBand, RDMA).
  • Strong security / IAM understanding, including Zero Trust and cloud IAM.
  • Proficiency with containerization (Docker, Podman, Singularity) for HPC / batch compute environments.
  • Benefits :

  • Base salary : $205,000 – $235,000 (depending on experience and location).
  • Comprehensive benefits package : medical, dental, and vision coverage; life and AD&D insurance.
  • Paid time off : 20 vacation days and 9 sick days annually.
  • Retirement plan : 401(k) with company match.
  • Opportunities to work on cutting-edge HPC and ML infrastructure at scale.
  • Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.

    When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly.

    🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience, and achievements.

    📊 It compares your profile to the job’s core requirements and past success factors to determine your match score.

    🎯 Based on this analysis, we automatically shortlist the 3 candidates with the highest match to the role.

    🧠 When necessary, our human team may perform an additional manual review to ensure no strong profile is missed.

    The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role. Once the shortlist is completed, we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team.

    Thank you for your interest!

    #LI-CL1

    Create a job alert for this search

    Senior Site Reliability Engineer • CA, US

    Related jobs
    • Promoted
    Senior Distinguished Engineer

    Senior Distinguished Engineer

    VirtualVocationsFresno, California, United States
    Full-time
    Distinguished Engineer - SDUI (Remote Eligible).Key Responsibilities : Articulate and evangelize a bold technical vision for the domain Decompose complex problems into practical and operational s...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Solutions Engineer

    Senior Solutions Engineer

    VirtualVocationsFresno, California, United States
    Full-time
    Solutions Engineer to join their Launch team in shaping AI-powered shopping experiences.Key Responsibilities Provide technical expertise and solution demonstrations during the pre-sale process S...Show moreLast updated: 30+ days ago
    • Promoted
    Customer Reliability Engineer

    Customer Reliability Engineer

    VirtualVocationsFresno, California, United States
    Full-time
    A company is looking for a Customer Reliability Engineer III.Key Responsibilities Manage and resolve customer technical issues via support tickets and real-time interactions Act as a liaison bet...Show moreLast updated: 30+ days ago
    • Promoted
    Cloud Engineer Senior Advisor

    Cloud Engineer Senior Advisor

    VirtualVocationsVisalia, California, United States
    Full-time
    A company is looking for a Cloud Engineer Sr Principal.Key Responsibilities Support government cloud infrastructure and operations, primarily in Azure Gov Cloud Collaborate with various teams to...Show moreLast updated: 2 days ago
    • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    VirtualVocationsVisalia, California, United States
    Full-time
    A company is looking for a Principal Site Reliability Engineer.Key Responsibilities Lead project work to build and maintain platform features for reliability and cloud infrastructure Mentor serv...Show moreLast updated: 30+ days ago
    SRE - Site Reliability Engineer (REMOTE)

    SRE - Site Reliability Engineer (REMOTE)

    Forward Progress StaffingCA, United States
    Remote
    Full-time
    Quick Apply
    Our client is looking for a remote SRE! What You'll Do : Configure and maintain cloud infras...Show moreLast updated: 1 day ago
    • Promoted
    Service Reliability Engineer

    Service Reliability Engineer

    VirtualVocationsVisalia, California, United States
    Full-time
    Key Responsibilities Provision, maintain, and scale production services across multiple data centers and hyperscaler environments Assist in the design, testing, and rollout of a Microsoft Exchan...Show moreLast updated: 2 days ago
    • Promoted
    Technical Lead CI / CD Engineer

    Technical Lead CI / CD Engineer

    VirtualVocationsVisalia, California, United States
    Full-time
    A company is looking for a Technical Lead - CI / CD Engineer to architect, implement, and manage continuous integration and deployment pipelines. Key Responsibilities Design and maintain end-to-end ...Show moreLast updated: 1 day ago
    • Promoted
    Senior MLOps Engineer

    Senior MLOps Engineer

    VirtualVocationsFresno, California, United States
    Full-time
    A company is looking for a Senior MLOps Engineer to design and scale infrastructure for AI research and product development. Key Responsibilities Identify and resolve infrastructure and software b...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    VirtualVocationsVisalia, California, United States
    Full-time
    A company is looking for a Site Reliability Engineer.Key Responsibilities Become a subject matter expert in applications supporting customers Collaborate with teams to evaluate, deploy, and debu...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Developer

    Site Reliability Developer

    VirtualVocationsVisalia, California, United States
    Full-time
    A company is looking for a Site Reliability Developer.Key Responsibilities Perform DevOps activities to support customers and engineers during release cycles and production Respond to incidents,...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineering Manager

    Site Reliability Engineering Manager

    VirtualVocationsVisalia, California, United States
    Full-time
    A company is looking for a Manager, Software Engineer.Key Responsibilities Define and execute the strategic vision and roadmap for the Site Reliability Engineering function Provide leadership an...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    VirtualVocationsFresno, California, United States
    Full-time
    A company is looking for a Senior Site Reliability Engineer.Key Responsibilities Design and implement infrastructure and automation scripts for AWS deployment and management Optimize and monitor...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Cloud Systems Engineer

    Senior Cloud Systems Engineer

    VirtualVocationsVisalia, California, United States
    Full-time
    A company is looking for a Senior Systems Engineer, Cloud Platform.Key Responsibilities Maintain system stability, security, and performance Build and manage CI / CD pipelines and deployment autom...Show moreLast updated: 1 day ago
    • Promoted
    Retail Specialist

    Retail Specialist

    Hume Christian CampsHume, CA, US
    Full-time
    Vision for the Role : The Retail Specialist is responsible for assisting the Retail Supervisor and Retail Coordinator in all levels of operations of Hume Apparel and Supply Co while working in a tea...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Forward Deployed Engineer

    Senior Forward Deployed Engineer

    VirtualVocationsVisalia, California, United States
    Full-time
    A company is looking for a Senior Forward Deployed Engineer, AI (Remote).Key Responsibilities Lead the design, development, and deployment of AI / ML-powered solutions tailored to customer needs A...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Systems Engineer

    Senior Systems Engineer

    VirtualVocationsVisalia, California, United States
    Full-time
    A company is looking for a Senior Advanced Systems Engineer.Key Responsibilities Participate in missile tracking and defense initiatives as part of a cross-functional team Conduct requirements a...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Advanced Systems Engineer

    Senior Advanced Systems Engineer

    VirtualVocationsFresno, California, United States
    Full-time
    A company is looking for a Senior Advanced Systems Engineer (Tracking).Key Responsibilities Collaborate with customers and stakeholders to translate requirements into system designs Decompose en...Show moreLast updated: 26 days ago