Talent.com
HPC Performance and Validation Engineer
HPC Performance and Validation EngineerNorthMark Strategies • Dallas, TX, United States
HPC Performance and Validation Engineer

HPC Performance and Validation Engineer

NorthMark Strategies • Dallas, TX, United States
Hace 1 día
Tipo de contrato
  • A tiempo completo
Descripción del trabajo

The Company

NorthMark Compute & Cloud (NMC²) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (HPC) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.

The Position

As an HPC Validation and Performance Engineer at NMC², you will take ownership of the validation and optimization of our HPC CPU and GPU calc farms. This critical role will involve developing a validation and performance baselining framework, which ensures system readiness for AI / ML and HPC workloads across multiple architectures. Your role will be essential in providing continuous performance benchmarking, real-time observability, and long-term strategic readiness. You will drive the implementation of advanced tooling and frameworks, maintaining an infrastructure that is crucial to our cutting-edge research efforts. You will be accountable for providing data driven performance metrics to support architectural design choices as we continue to globally scale our datacenter footprint. We are looking for someone with deep technical expertise in compute, storage or networking optimizations and performance engineering who can develop solutions that scale with our growing infrastructure. This role demands a forward-thinking engineer who can anticipate industry trends and adopt emerging architectures and strategies to keep NMC² at the forefront of innovation.

Responsibilities :

  • Architecting and implementing a validation framework to certify the readiness and utilization of GPU nodes across a large, distributed HPC environment.
  • Defining methodologies to continually assess performance and optimising infrastructure across AI / ML workloads
  • Developing and executing comprehensive performance testing using industry and customer specific benchmarks, ensuring optimal performance across HPC compute, storage and networking
  • Contribute to research reports that will describe the discoveries of the benchmarking, evaluating the complete HW performance and efficiency
  • Leading efforts to debug, identify and then resolve bottlenecks in system performance
  • Building robust, scalable tools for automated validation and testing, utilising Python, Go, Kubernetes and CI / CD pipelines to streamline continuous validation and benchmarking processes
  • Implementing monitoring solutions using Prometheus, Grafana and other modern monitoring technologies to track performance metrics and real-time health of the cluster
  • Defining and implementing best practice for continuous performance validation, ensuring that the infrastructure remains reliable and efficient as new technologies emerge
  • Staying informed on industry trends and advancements to ensure long-term strategic alignment
  • Working cross-functionally with engineering, infrastructure and research teams to align validation efforts with the broader business objectives, ensuring that the platform meets evolving research demands

Requirements :

  • Accelerator performance experience, including profiling and tuning with large-scale GPU clusters
  • In-depth understanding of NVIDIA ClusterKit, Nsight and Validation Suite, MLPerf and DCGM tools for GPU and DPUs
  • Networking & Storage performance experience, including profiling and optimisation with NVIDIA ClusterKit, iPerf or equivalent across InfiniBand / RoCe network implementations
  • System benchmarking experience across Linux and familiarity with the Phronix suite or equivalent
  • Experience with HPC workloads across distributed global locations, bringing data driven performance data to compliment key architectural decisions
  • Strong proficiency in developing automation tools and micro benchmarking frameworks for validation using Python, Go, and Kubernetes in a Ubuntu Linux environment
  • Expertise with key monitoring platforms including OTEL, Prometheus, ELK and Grafana and in definition and implementing the overall observability strategy for HPC validation and performance monitoring
  • A deep understanding of emerging technologies, architectures and strategies, with the ability to assess their potential impact on infrastructure and adopt them as part of a long-term plan
  • Proven ability to lead complex technical projects, influence decisions and engage with stakeholders across technical and research teams
  • Crear una alerta de empleo para esta búsqueda

    Validation Engineer • Dallas, TX, United States

    Ofertas relacionadas
    Quality Engineering Specialist - DAL

    Quality Engineering Specialist - DAL

    AP Recruiters • Dallas, TX, United States
    A tiempo completo
    About the job Quality Engineering Specialist - DAL.Quality Engineering Specialist.As a member of the Application Automation Development staff, the candidate would participate in the detailed design...Mostrar más
    Última actualización: hace más de 30 días • Oferta promocionada
    QA Program Manager

    QA Program Manager

    Blue Ribbon Global technologies LLC • Plano, TX, US
    A tiempo completo
    Hello, Blue Ribbon Global Technologies, LLC.CA -Onsite / Plano Role-Test Program Manager- (QA Program Manager - Digital Transformation) Location : -Fremont. CA -Onsite If you are submitting Dallas- or ...Mostrar más
    Última actualización: hace 20 días • Oferta promocionada
    QA Engineer - Telecommunications

    QA Engineer - Telecommunications

    CyberCoders • Garland, TX, US
    A tiempo completo
    QA Engineer - Telecommunications.Established (30+ Years) Company Seeking QA Engineer for Telecommunications Software in Plano, TX. An established technology company with over 30 years in business is...Mostrar más
    Última actualización: hace 13 días • Oferta promocionada
    Tech - QC

    Tech - QC

    Employee Magnets • Grapevine, TX, US
    A tiempo completo
    The QC Technician is responsible to verify the product meets specified requirement prior to shipping.Essential Duties and Responsibilities include the following. Perform inspection and report any ab...Mostrar más
    Última actualización: hace 3 días • Oferta promocionada
    Verification and Validation Engineer

    Verification and Validation Engineer

    Spectral MD • Dallas, TX, United States
    A tiempo completo
    The Verification and Validation Engineer will be responsible for all activities related to the design verification and design validation ("V&V") of Spectral MD, Inc. This includes, but is not limite...Mostrar más
    Última actualización: hace más de 30 días • Oferta promocionada
    Associate Principal, Quality Engineering

    Associate Principal, Quality Engineering

    The Options Clearing Corporation • Dallas, TX, United States
    A tiempo completo
    Drive external and internal testing requests related to test execution, including application level setup within ENCORE system, creating testing inputs, initiating and monitoring the applicable pro...Mostrar más
    Última actualización: hace más de 30 días • Oferta promocionada
    ETL QA Engineer

    ETL QA Engineer

    Charles Schwab • Southlake, TX, United States
    A tiempo completo
    At Schwab, you're empowered to make an impact on your career.Here, innovative thought meets creative problem solving, helping us challenge the status quo and transform the finance industry together...Mostrar más
    Última actualización: hace 1 día • Oferta promocionada
    Quality Operations Engineer

    Quality Operations Engineer

    Intuitive • Dallas, TX, United States
    A tiempo completo +2
    At Intuitive, we are united behind our mission : we believe that minimally invasive care is life-enhancing care.Through ingenuity and intelligent technology, we expand the potential of physicians to...Mostrar más
    Última actualización: hace 1 día • Oferta promocionada
    Automation Test Lead

    Automation Test Lead

    Sparktek • Dallas, TX, United States
    A tiempo completo
    Selenium # Java Test Automation.Experience with Java-based test automation frameworks and tools, such as Selenium, Cucumber, and JUnit. Design, develop, and maintain Java-based test automation frame...Mostrar más
    Última actualización: hace 1 día • Oferta promocionada
    Telecommunications Systems & Test Engineer

    Telecommunications Systems & Test Engineer

    Inter-Commercial Business Systems • ALLEN, TX, US
    A tiempo completo
    DUTIES / RESPONSIBILITIES include, but not limited to : Set up and implement test systems to facilitate the testing and repair of telecom voice and data transmission products.Create and maintain detai...Mostrar más
    Última actualización: hace más de 30 días • Oferta promocionada
    Failure Analysis Engineer

    Failure Analysis Engineer

    Ultimate Staffing • Grapevine, TX, US
    A tiempo completo
    Ultimate Staffing is seeking a.The role is 100% onsite in Grapevine, TX.The Hardware Failure Analysis Engineer will be responsible for investigating, diagnosing, and resolving hardware-related issu...Mostrar más
    Última actualización: hace 20 días • Oferta promocionada
    Qlik Sense Developer

    Qlik Sense Developer

    Diverse Lynx • Irving, TX, United States
    A tiempo completo
    Role name : Qlik Sense Developer.Location : IRVING, TX- Onsite Job.Keywords : Qlik Sense and any SQL Knowledge.Diverse Lynx LLC is an Equal Employment Opportunity employer. All qualified applicants wil...Mostrar más
    Última actualización: hace más de 30 días • Oferta promocionada
    Senior Mainframe Test Analyst

    Senior Mainframe Test Analyst

    eTeam • Southlake, TX, United States
    A tiempo completo
    Job Title : Senior Mainframe Test Analyst.Interview Process (Is face to face required?) Yes.Mainframes testing including good knowledge on mainframes that includes JCL, COBOL, VSAM, ISPF, DB2, CICS,...Mostrar más
    Última actualización: hace más de 30 días • Oferta promocionada
    Quality Engineer First Article Inspection

    Quality Engineer First Article Inspection

    Verify Incorporated • Irving, TX, United States
    A tiempo completo
    Job Title : Quality Engineer AS9102 First Article Inspection.Reports To : FAI Operations Manager.Full-time - Monday thru Friday 8 Am 5PM. Should be open to overtime as needed with notice.We are seekin...Mostrar más
    Última actualización: hace 1 día • Oferta promocionada
    Performance Test Lead

    Performance Test Lead

    Tekfortune Inc • Plano, TX, United States
    Indefinido
    Tekfortune is a fast-growing consulting firm specialized in permanent, contract & project-based staffing services for world's leading organizations in a broad range of industries.In this quickly ch...Mostrar más
    Última actualización: hace 1 día • Oferta promocionada
    Certification Engineer III

    Certification Engineer III

    Recaro Aircraft Seating Americas, Inc. • Fate, TX, US
    A tiempo completo
    The Certification Engineer III is responsible for working with the internal program team, installers, suppliers, FAA, EASA and designees to research and to plan and execute large and small scale ce...Mostrar más
    Última actualización: hace 1 día • Oferta promocionada
    Quality Engineer Technician, Calibration Coordinator

    Quality Engineer Technician, Calibration Coordinator

    Stryker • Flower Mound, TX, United States
    A tiempo completo
    Quality Engineer Technician, Calibration Coordinator.The Quality Engineer Technician, Calibration Coordinator, will be responsible for managing the calibration program, ensuring compliance with ind...Mostrar más
    Última actualización: hace 1 día • Oferta promocionada
    AQL (Quality Control)

    AQL (Quality Control)

    OnTrack Staffing Addison, TX • Coppell, TX, US
    A tiempo completo
    Shift Monday to Thursday 6 : 00am to 4 : 30pm.Must have computer skills- Office 365.AQL activities & provide direction and scheduling changes as needed for AQLs due to business changes and needs.Mu...Mostrar más
    Última actualización: hace más de 30 días • Oferta promocionada