Talent.com
Slurm Administration & Systems Architecture

Slurm Administration & Systems Architecture

MidjourneyHayward, CA, United States
30+ days ago
Job type
  • Full-time
Job description

Overview

We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI / ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC / AI / ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI / CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog / prolog, pam_slurm_adopt) to extend functionality and integrate with authentication, and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.
  • System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP / SSSD, VPN, PAM, SSH session auditing).
  • User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker / Podman / Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.
  • Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI / ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS / preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100 / 200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI / CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.
  • Create a job alert for this search

    Administration • Hayward, CA, United States

    Related jobs
    • Promoted
    Licensed Masters Mental Health Professional - Intensive Services

    Licensed Masters Mental Health Professional - Intensive Services

    KaiserANTIOCH, California, United States
    Full-time
    Provides mental health assessment, diagnosis, treatment and crisis intervention services for adult and / or child members who present themselves from psychiatric evaluation with a broad range of ment...Show moreLast updated: 30+ days ago
    • Promoted
    Signals Intelligence Systems Architect

    Signals Intelligence Systems Architect

    Monarch RecruitersSan Jose, CA, US
    Full-time
    Our client has an immediate opening for a.The position provides an opportunity to deliver systems that provide critical intelligence data to national leadership. Our Client’s employees work cl...Show moreLast updated: 27 days ago
    • Promoted
    Ground Software & Systems Manager - Mission Operations (0346U), Space Sciences Laboratory - 81263

    Ground Software & Systems Manager - Mission Operations (0346U), Space Sciences Laboratory - 81263

    InsideHigherEdBerkeley, California, United States
    Full-time
    Ground Software & Systems Manager - Mission Operations (0346U), Space Sciences Laboratory - 81263.At the University of California, Berkeley, we are dedicated to fostering a community where everyone...Show moreLast updated: 30+ days ago
    • Promoted
    Systems Architect

    Systems Architect

    Reliable RoboticsMountain View, CA, United States
    Permanent
    We're building safety-enhancing technology for aviation that will save lives.Automated aviation systems will enable a future where air transportation is safer, more convenient and fundamentally tra...Show moreLast updated: 30+ days ago
    • Promoted
    Sr. Manager, Systems Architect - Financial Planning Tool Center of Excellence

    Sr. Manager, Systems Architect - Financial Planning Tool Center of Excellence

    ElasticMountain View, CA, United States
    Full-time
    Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale - unleashing the potential of businesses and people.The Elastic Search AI...Show moreLast updated: 16 days ago
    • Promoted
    Staff Systems Engineer

    Staff Systems Engineer

    Bio-Rad LaboratoriesHercules, CA, United States
    Full-time
    Working within Bio-Rad's Life Science R&D Group as a Systems Engineer, you will take engineering concepts, requirements and transform them into functional prototypes and finished products that impr...Show moreLast updated: 6 days ago
    • Promoted
    Principal Solutions Architect - Observability

    Principal Solutions Architect - Observability

    ElasticMountain View, CA, United States
    Full-time
    Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale - unleashing the potential of businesses and people.The Elastic Search AI...Show moreLast updated: 30+ days ago
    • Promoted
    HPC Linux Systems Administrator

    HPC Linux Systems Administrator

    JobotBerkeley, CA, US
    Full-time
    This Jobot Job is hosted by : Kurt Holzmuller.Are you a fit? Easy Apply now by clicking the "Apply Now" button and sending us your resume. Salary : $120,000 - $180,000 per year.We are a leading global...Show moreLast updated: 30+ days ago
    • Promoted
    HPC Technical Systems Support Analyst - DoE Q or TS clearance

    HPC Technical Systems Support Analyst - DoE Q or TS clearance

    JobotLivermore, CA, US
    Full-time
    This Jobot Job is hosted by : Kurt Holzmuller.Are you a fit? Easy Apply now by clicking the "Apply Now" button and sending us your resume. Salary : $130,000 - $180,000 per year.We are a leading global...Show moreLast updated: 30+ days ago
    • Promoted
    Sr. Solution Architect - Datacenter Software Solutions (27483)

    Sr. Solution Architect - Datacenter Software Solutions (27483)

    SupermicroSan Jose, CA, United States
    Full-time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...Show moreLast updated: 7 days ago
    • Promoted
    HPC Systems Administrator - DOE Q or TS Clearance

    HPC Systems Administrator - DOE Q or TS Clearance

    JobotLivermore, CA, US
    Full-time
    HPC System Administrator (Linux, Compute, Network, Storage) needed for a global, industry-leading enterprise IT, Software, Cloud and Solutions firm!. This Jobot Job is hosted by : Kurt Holzmuller.Are...Show moreLast updated: 30+ days ago
    • Promoted
    Slurm Administration & Systems Architecture

    Slurm Administration & Systems Architecture

    MidjourneySan Francisco, CA, US
    Full-time
    We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster en...Show moreLast updated: 30+ days ago
    • Promoted
    Solutions Architect - Enterprise Assessment Management SME

    Solutions Architect - Enterprise Assessment Management SME

    OpenGovSan Francisco, CA, United States
    Full-time
    OpenGov is the leader in AI and ERP solutions for local and state governments in the U.More than 2,000 cities, counties, state agencies, school districts, and special districts rely on the OpenGov ...Show moreLast updated: 30+ days ago
    • Promoted
    Solution Architect - Presales

    Solution Architect - Presales

    Informatica LLCRedwood City, CA, United States
    Full-time
    Build Your Career at Informatica.We seek innovative thinkers who believe in the power of data to drive meaningful change. At Informatica, we welcome adventurous minds eager to solve the world's most...Show moreLast updated: 23 days ago
    • Promoted
    Political Affairs Internship Part-Time in Worldwide - Remote Worldwide - Political Team

    Political Affairs Internship Part-Time in Worldwide - Remote Worldwide - Political Team

    The Borgen ProjectAntioch, CA, United States
    Remote
    Part-time +1
    Are you passionate about making a difference in the world? Look no further! The Borgen Project is an international organization that works at the political level to improve living conditions for pe...Show moreLast updated: 9 days ago
    • Promoted
    Network Communication System Specialist

    Network Communication System Specialist

    United States ArmyIsleton, CA, US
    Part-time +1
    Network Systems Specialist Job Overview : Join our team as a Network Communications Systems Specialist, where you'll lead in overseeing network management functions, integrated control centers, and ...Show moreLast updated: 30+ days ago
    • Promoted
    Field Service Management Solution Architect

    Field Service Management Solution Architect

    Celerity Consulting Group, Inc.Walnut Creek, CA, US
    Full-time
    Solution Architect (Field Service Management).Remote (work for home eligible) Travel may be required.Celerity is a consulting firm specializing in system integration solutions for the utilities and...Show moreLast updated: 1 day ago
    • Promoted
    Presales Solution Architect

    Presales Solution Architect

    Informatica LLCRedwood City, CA, United States
    Full-time
    Build Your Career at Informatica.We seek innovative thinkers who believe in the power of data to drive meaningful change. At Informatica, we welcome adventurous, work-from-anywhere minds eager to so...Show moreLast updated: 23 days ago