Talent.com
Slurm Administration & Systems Architecture
Slurm Administration & Systems ArchitectureMidjourney • Fremont, CA, United States
Slurm Administration & Systems Architecture

Slurm Administration & Systems Architecture

Midjourney • Fremont, CA, United States
26 days ago
Job type
  • Full-time
Job description

Overview

We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI / ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC / AI / ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI / CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog / prolog, pam_slurm_adopt) to extend functionality and integrate with authentication, and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.
  • System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP / SSSD, VPN, PAM, SSH session auditing).
  • User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker / Podman / Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.
  • Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI / ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS / preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100 / 200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI / CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.
  • Create a job alert for this search

    Administration • Fremont, CA, United States

    Related jobs
    Implementation Team Lead (Remote- US)

    Implementation Team Lead (Remote- US)

    SpryPoint • Concord, CA, US
    Remote
    Full-time
    Implementation Team Lead (Remote- US).Simply put, SpryPoint provides Smart Solutions for Smart Utilities.Founded in 2011, SpryPoint is a leading provider of cloud-based solutions for the utility se...Show more
    Last updated: 23 days ago • Promoted
    Systems Modernization & Tech Delivery Solution Architect - Manager

    Systems Modernization & Tech Delivery Solution Architect - Manager

    PwC • San Jose, CA, United States
    Full-time
    At PwC, our people in integration and platform architecture focus on designing and implementing seamless integration solutions and robust platform architectures for clients.They enable efficient da...Show more
    Last updated: 18 days ago • Promoted
    Systems Architect

    Systems Architect

    Reliable Robotics • Mountain View, CA, United States
    Permanent
    We're building safety-enhancing technology for aviation that will save lives.Automated aviation systems will enable a future where air transportation is safer, more convenient and fundamentally tra...Show more
    Last updated: 30+ days ago • Promoted
    Lead Systems Architect

    Lead Systems Architect

    Info Way Solutions • Fremont, CA, United States
    Full-time
    This is Jayaraman from Info Way Solutions; LLC We have job opening for.Job description is given below : .Kindly check the JD and share your views. Location : Concord, CA or Chandler AZ, Charlotte, NC o...Show more
    Last updated: 4 days ago • Promoted
    Managed Services Solutions Architect

    Managed Services Solutions Architect

    Arrow Electronics • Santa Clara, CA, United States
    Full-time
    Managed Services Solutions Architect.Arrow ECS, a division of Arrow Electronics, is a global technology enablement company that delivers innovative IT solutions and services to drive digital transf...Show more
    Last updated: 18 days ago • Promoted
    Solutions Architect

    Solutions Architect

    Radixiy Inc. • San Jose, CA, United States
    Full-time
    At Radixiy, we’re not just building software, we’re engineering experiences, unlocking possibilities, and shaping the future of the digital world. We partner with visionary clients across industries...Show more
    Last updated: 1 day ago • Promoted
    Solutions Architect

    Solutions Architect

    Cupertino Electric, Inc. • San Jose, CA, United States
    Full-time
    Director, Business Technology and Analytics.Final determination of a successful candidate’s starting pay will vary based on a number of factors, including market location and may vary depending on ...Show more
    Last updated: 30+ days ago • Promoted
    Systems Architect for High-Performance Systems

    Systems Architect for High-Performance Systems

    Molex • Fremont, CA, United States
    Full-time
    We are seeking a highly skilled and experienced System Architect to develop the cutting edge next-generation hardware systems tailored for high-performance computing (HPC) and artificial intelligen...Show more
    Last updated: 30+ days ago • Promoted
    Sr. Solution Architect - Datacenter Software Solutions (27483)

    Sr. Solution Architect - Datacenter Software Solutions (27483)

    Supermicro • San Jose, CA, United States
    Full-time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...Show more
    Last updated: 29 days ago • Promoted
    Senior Project Architect (Sacramento-San Joaquin Delta)

    Senior Project Architect (Sacramento-San Joaquin Delta)

    BRERETON • Sacramento-San Joaquin Delta, California, US
    Part-time
    Senior Project Architect / Manager Architecture + Interiors.Brereton is hiring a seasoned Senior Project Architect / Manager to lead the execution of complex architectural and interiors projects from ...Show more
    Last updated: 2 days ago • Promoted
    Lead Engineer, LLM / Agent Systems Architect

    Lead Engineer, LLM / Agent Systems Architect

    Fastino • Palo Alto, CA, United States
    Full-time
    A tech startup in Palo Alto is seeking a Lead Engineer to architect and build scalable systems for agentic applications.This role requires 5+ years of experience in software engineering, strong fou...Show more
    Last updated: 11 hours ago • Promoted • New!
    Sr Solutions Architect (Pleasanton, CA)

    Sr Solutions Architect (Pleasanton, CA)

    Presidio Networked Solutions, LLC • Pleasanton, CA, United States
    Full-time
    Presidio, Where Teamwork and Innovation Shape the Future.AtPresidio, we're at the forefront of a global technology revolution, transforming industries throughcutting-edge digital solutions and next...Show more
    Last updated: 23 days ago • Promoted
    Autonomy Systems Architect

    Autonomy Systems Architect

    Joby Aviation • Concord, CA, United States
    Permanent
    Imagine a piloted air taxi that takes off vertically, then quietly carries you and your fellow passengers over the congested city streets below, enabling you to spend more time with the people and ...Show more
    Last updated: 14 days ago • Promoted
    Lead Solution / Systems Architect

    Lead Solution / Systems Architect

    TCI • Fremont, CA, United States
    Permanent
    Building the people that build the world.With platforms in HVAC and Detection and Measurement, SPX Technologies builds innovative solutions that enable a safer, more efficient, sustainable world.Th...Show more
    Last updated: 30+ days ago • Promoted
    Systems Architect for High-Performance Systems

    Systems Architect for High-Performance Systems

    Koch Industries • Fremont, CA, United States
    Full-time
    We are seeking a highly skilled and experienced System Architect to develop the cutting edge next-generation hardware systems tailored for high-performance computing (HPC) and artificial intelligen...Show more
    Last updated: 30+ days ago • Promoted
    Solutions Architect

    Solutions Architect

    Thelevel • Mountain View, CA, United States
    Full-time
    Level AI was founded in 2019 and is a Series C startup headquartered in Mountain View, California.Level AI revolutionizes customer engagement by transforming contact centers into strategic assets.O...Show more
    Last updated: less than 1 hour ago • Promoted • New!
    Slurm Administration & Systems Architecture

    Slurm Administration & Systems Architecture

    Midjourney • Hayward, CA, United States
    Full-time
    We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster en...Show more
    Last updated: 25 days ago • Promoted
    IT / IS

    IT / IS

    Iconma • Concord, CA, US
    Full-time
    Our client, a Emergency Services and Technology company, is looking for a Director of IT / IS for their Remote / ND location. Requirements : Must be an effective leader with great communication System im...Show more
    Last updated: 23 days ago • Promoted