Talent.com
Slurm Administration & Systems Architecture

Slurm Administration & Systems Architecture

MidjourneyAlameda, CA, US
30+ days ago
Job type
  • Full-time
Job description

Overview

We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI / ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC / AI / ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI / CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog / prolog, pam_slurm_adopt) to extend functionality and integrate with authentication, and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.
  • System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP / SSSD, VPN, PAM, SSH session auditing).
  • User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker / Podman / Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.
  • Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI / ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS / preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100 / 200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI / CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.
  • Create a job alert for this search

    Administration • Alameda, CA, US

    Related jobs
    • Promoted
    Principal Oracle Solutions Architect

    Principal Oracle Solutions Architect

    Presidio TrustSan Francisco, CA, United States
    Full-time
    Career Opportunities with Presidio Trust.Current job opportunities are posted here as they become available.The Presidio Trust is seeking a Principal Oracle Solutions Architect to join the IT Depar...Show moreLast updated: 2 days ago
    Solutions Architect

    Solutions Architect

    Lever Demo - IS OpportunitiesSan Francisco, California, United States, 94102
    Full-time
    PLEASE READ : these jobs are testing jobs of Lever's testing environment - please do not apply for this job.Lever was founded ten years ago to tackle the most strategic challenge that companies face...Show moreLast updated: 30+ days ago
    • Promoted
    Sr. Solution Architect

    Sr. Solution Architect

    SupermicroSan Jose, CA, United States
    Full-time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Platform Solutions Architect

    Platform Solutions Architect

    VirtualVocationsConcord, California, United States
    Full-time
    A company is looking for a Platform Solutions Architect (Pre-Sales) - Remote.Key Responsibilities Drive pre-sales activities and craft detailed solution designs and proposals Design and present ...Show moreLast updated: 20 hours ago
    • Promoted
    Systems Architect

    Systems Architect

    Reliable RoboticsMountain View, CA, United States
    Permanent
    We're building safety-enhancing technology for aviation that will save lives.Automated aviation systems will enable a future where air transportation is safer, more convenient and fundamentally tra...Show moreLast updated: 30+ days ago
    • Promoted
    Solutions Architect

    Solutions Architect

    VirtualVocationsSan Francisco, California, United States
    Full-time
    A company is looking for a Solutions Architect to join their Applied Engineering team.Key Responsibilities Own complete post-sales customer engagements, providing direct technical guidance and so...Show moreLast updated: 30+ days ago
    Solutions Architect

    Solutions Architect

    TechBiz Global GmbHSan Francisco, CA, US
    Full-time
    At TechBiz Global, we are providing recruitment service to our TOP clients from our portfolio.We are currently seeking a bilingual. If you're looking for an exciting opportunity to grow in an innova...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Solutions Architect - Observability

    Principal Solutions Architect - Observability

    ElasticMountain View, CA, United States
    Full-time
    Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale - unleashing the potential of businesses and people.The Elastic Search AI...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Solution Architect

    Principal Solution Architect

    VirtualVocationsConcord, California, United States
    Full-time
    A company is looking for a Principal Solution Architect, Managed Services.Key Responsibilities : Propose, design, and provision cloud-native data solutions on AWS / Azure Lead a technical team mana...Show moreLast updated: 30+ days ago
    • Promoted
    Solutions Architect

    Solutions Architect

    BlubizPalo Alto, CA, United States
    Full-time
    BluBiz Solutions is seeking a technically proficient and ambitious Solutions Architect with hands-on expertise in networking, cybersecurity, and cloud technologies. This dynamic role is suited to so...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Solutions Architect

    Principal Solutions Architect

    VirtualVocationsSan Francisco, California, United States
    Full-time
    A company is looking for a Principal Solutions Architect (Data).Key Responsibilities Designs and builds relational databases for data storage and processing Develops and enforces data architectu...Show moreLast updated: 30+ days ago
    • Promoted
    Systems Engineer - Networks and Architecture

    Systems Engineer - Networks and Architecture

    WaymoMountain View, CA, United States
    Full-time
    Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show moreLast updated: 30+ days ago
    • Promoted
    Solutions Architect - Enterprise Assessment Management SME

    Solutions Architect - Enterprise Assessment Management SME

    OpenGovSan Francisco, CA, United States
    Full-time
    OpenGov is the leader in AI and ERP solutions for local and state governments in the U.More than 2,000 cities, counties, state agencies, school districts, and special districts rely on the OpenGov ...Show moreLast updated: 8 days ago
    • Promoted
    Sr. Solution Architect - Enterprise

    Sr. Solution Architect - Enterprise

    SupermicroSan Jose, CA, United States
    Full-time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...Show moreLast updated: 30+ days ago
    • Promoted
    Software Systems Architect - Scientific Instruments

    Software Systems Architect - Scientific Instruments

    PicarroSanta Clara, CA, United States
    Full-time
    Software Systems Architect - Scientific Instruments.Bay Area - Primarily onsite with occasional remote flexibility.We're hiring an Software Systems Architect to define and evolve the top-tier softw...Show moreLast updated: 30+ days ago
    • Promoted
    Slurm Administration & Systems Architecture

    Slurm Administration & Systems Architecture

    MidjourneySan Mateo, CA, US
    Full-time
    We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster en...Show moreLast updated: 30+ days ago
    • Promoted
    Enterprise Solutions Architect

    Enterprise Solutions Architect

    VirtualVocationsSan Francisco, California, United States
    Full-time
    A company is looking for an Enterprise Solutions Architect responsible for designing and implementing enterprise-wide technical solutions. Key Responsibilities Design and implement scalable, secur...Show moreLast updated: 30+ days ago
    • Promoted
    Solutions Architect

    Solutions Architect

    Stefanini, IncSan Francisco, CA, United States
    Full-time
    Join us to co-create solutions for a better future!.Solutions Architect – San Francisco, CA – Posted : 9 / 30 / 2025.Job Category : Information Technology. Stefanini is looking for a Solutions Architect i...Show moreLast updated: 14 days ago
    • Promoted
    Principal Solutions Architect - Cloudflare One

    Principal Solutions Architect - Cloudflare One

    Cloudflare, Inc.San Francisco, CA, United States
    Full-time
    We realize people do not fit into neat boxes.We are looking for curious and empathetic individuals who are committed to developing themselves and learning new skills, and we are ready to help you d...Show moreLast updated: 30+ days ago
    • Promoted
    System Architect, Simulations & Models

    System Architect, Simulations & Models

    PsiQuantumPalo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago