Role : Senior HPC Administrator (High Performance Computational and Data Ecosystem)
Location : New York, Ny Onsite
Job Type : Full Time Role
Job description :
The Senior HPC Administrator, High Performance Computational and Data Ecosystem , is responsible for a computational and data science ecosystem for researchers at Mount Sinai. This ecosystem includes high-performance computing (HPC) systems, clinical research databases, and a software development infrastructure for local and national projects. To meet Sinai's scientific and clinical goals, the Senior Administrator has a good technical understanding for computational, data and software development systems along with a strong focus on customer service for researchers. The HPC Senior Administrator is an expert troubleshooter and productive team member and leads projects to effective and efficient completion independently under little to no supervision. This position reports to the Director for Computational & Data Ecosystem in Scientific Computing. Specific responsibilities are listed below.
Responsibilities
- Design, deploy and maintain Scientific Computing's computational and data science ecosystem including ~30,000 cores with high bandwidth, low latency interconnects, GPUs, large shared memory nodes, databases, scientific workflows and 30+ petabytes of storage in production, clinical data warehouse and software development environment.
- Lead the troubleshooting, isolation and resolution of all technical issues including application, system, hardware, software, and network). Actively monitors the systems.
- Maintains, tunes and manages computational, data, cloud technologies and workflow systems for ISMMS researchers, scientists and their external collaborators. Defines and deploys a comprehensive computational and data vision. Identifies and communicates system advantages / disadvantages and tradeoffs.
- Designs, develops, implements system administration tasks, including hardware and software configuration, configuration management, system monitoring (including the development and maintenance of regression tests), usage reporting, system performance (file systems, scheduler, interconnect, high availability, etc.), security, networking and metrics, etc.
- Collaborates effectively with research and hospital system IT, compliance, HIPAA, security and other departments to ensure compliance with all regulations and Sinai policies.
- Participates in the integration of HPC resources with laboratory equipment such as sequencers, clinical and research data resources and systems, etc. Incorporate and link data and compute resources.
- Researches, deploys and optimizes resource management and scheduling software and policies and actively monitoring. Designs, tunes, manages and upgrades parallel file systems, storage and data-oriented resources.
- Researches, deploys and manages security infrastructure, including development of policies and procedures.
- Maintain all necessary aspects of HPC in accordance with best practices. Develops and implements backup policies.
- Prepares and manages budgets for hardware, software and maintenance. Participates in chargeback / fee recovery analysis and provides suggestions to make operations sustainable.
- Assists in developing and writing system design for research proposals. Creates and provides clear documentation.
- Works effectively and productively with other team members within the group and across Mount Sinai.
- Performs related duties as assigned or requested.
- Provides after hours support for critical system and production issues.
- Answers and resolves user tickets.
Qualifications
Bachelor's degree in computer science, engineering or another scientific field. Master's or PhD preferred8+ years (higher preferred) of progressive HPC system administration and operations (preferably in a Redhat / CentOS Linux administration, Batch HPC cluster environment)Must be an expert troubleshooter; Must be a team player and customer focusedExperience with job scheduler such as LSF or Slurm and parallel file systems and storageExperience with networking and securityExperience with configuration management systems such as xCAT, Puppet and / or AnsibleExperience of databases and web servicesExperience in Infiniband, Gigabit EthernetExperience in an academic or research community environmentScript and programming experienceExperience with Cloud ComputingAbility to multitask effectively in a dynamic environmentExcellent communication skills, analytical ability, strong judgment and management skills, and the ability to work effectively as a liaison between both research and technology teams.Strong written, oral, and interpersonal communication skillsPreferred Experience
Advanced degreeExperience with GPFS, LSF, TSM, IB and ethernet networkingExperience with databases and web services is highly preferred