Talent.com
Senior Linux HPC Storage Engineer

Senior Linux HPC Storage Engineer

ITROak Ridge, TN, US
3 days ago
Job type
  • Full-time
Job description

Job Description

Job Description

  • Must be able to work a hybrid work schedule in Oak Ridge, TN
  • Must be eligible for a federal security clearance (US Citizen)

Major Duties / Responsibilities

  • Architect, deploy, and manage large-scale HPC storage systems, including parallel file systems such as Lustre, GPFS / Spectrum Scale, BeeGFS and WEKA
  • Design, implement, and operate large-scale Ceph storage clusters for HPC and research workloads, delivering reliable, high-performance object, block, and file storage services.
  • Ensure the availability, performance, scalability, and security of production storage environments.
  • Administer and optimize enterprise storage platforms such as Qumulo and NetApp in support of HPC and research workloads.
  • Design, deploy, and maintain archival storage solutions including Spectra Logic BlackPearl and large-scale tape libraries to ensure long-term data preservation and accessibility.
  • Integrate high-performance, enterprise, and archival storage layers into cohesive tiered storage architectures that balance cost, scalability, and performance for diverse scientific workflows.
  • Leverage automation and monitoring solutions to minimize day-to-day maintenance while identifying opportunities to optimize system performance and management.
  • Collaborate with researchers and technical POCs to support large data workflows and optimize I / O performance for scientific workloads.
  • Automate storage provisioning, monitoring, and maintenance using scripting and configuration management tools.
  • Diagnose and resolve complex storage and I / O-related issues in high-throughput, low-latency HPC environments.
  • Evaluate emerging storage technologies (NVMe, object storage, hierarchical storage management, burst buffers) and contribute to strategic planning for future HPC systems.
  • Work with 24 / 7 operations staff to streamline monitoring and troubleshooting, significantly reducing the need for off-hours support.
  • Deliver ORNL’s mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service. Promote equal opportunity by fostering a respectful workplace.
  • Basic Qualifications

  • A BS degree in computer science, computer engineering, information technology, information systems, science, engineering, or related discipline and 8–12 years of relevant professional experience; or an equivalent combination of education and experience.
  • Master’s degree holders : 7–10 years of relevant experience.

  • PhD holders : 4–6 years of relevant experience.
  • Five (5) or more years managing UNIX / Linux systems.
  • Demonstrated experience managing HPC storage and large-scale enterprise storage systems.
  • Three (3) or more years working with configuration management and automation tools such as Git, Jenkins, Ansible, or Puppet.
  • Proficiency with at least one scripting language (Bash, Python, Perl, etc.).
  • Strong Linux administration and advanced troubleshooting experience.
  • Experience supporting large data systems and / or HPC scientific workloads.
  • Strong desire to innovate and evaluate new technologies for HPC and storage environments.
  • Collaborative approach and ability to become a trusted advisor to research teams.
  • Preferred Qualifications

  • Active DOE Q, DoD Top Secret, or TS / SCI clearance is strongly preferred.
  • Solid understanding of multiple operating systems and HPC cluster technologies.
  • Experience with Rocky / CentOS / RHEL, Ubuntu, VMware.
  • Understanding of HPC job schedulers (SLURM) and user support workflows.
  • Experience with container technologies in HPC environments.
  • Experience with multiple system deployment mechanisms (Warewulf, PXEboot, Cobbler, Bright).
  • Experience with GPU clusters (NVIDIA, AMD) for AI / ML and scientific workloads.
  • Deep expertise with high-performance parallel file systems (Lustre, GPFS / Spectrum Scale, BeeGFS, WEKA).
  • Knowledge of storage networking (Infiniband, NVMe-oF, SAN / NAS architectures).
  • Familiarity with RAID, ZFS, and object storage technologies.
  • Strong background in performance monitoring, benchmarking, and I / O optimization.
  • Experience with monitoring systems such as Grafana, CheckMK, Nagios, Zabbix, Ganglia.
  • Previous experience working in a government, scientific, or other highly technical environment.
  • Strong documentation skills and ability to prepare web-based documentation.
  • Create a job alert for this search

    Senior Linux Engineer • Oak Ridge, TN, US