JOB DESCRIPTION
Required Skills :
- Deep understanding of hardware designs and subsystems ( BMC, PCIe, CPU, GPU , etc.)
- Proven experience with qualification of Hardware Designs for production release ( SKU Qual )
- Experience with testing component subsystems for use in existing SKUs ( Component Qual )
- Deep Linux systems experience including troubleshooting network interfaces
- Developing and applying configuration management, security best practices, and monitoring and alerting
- Experience with firmware testing and deployment ( Firmware Qual )
- Strong automation mindset
- Expert knowledge in 1 or more orchestration tools such as
Salt, Chef, Ansible, or Puppet , and strong Python skills
Strong communication skills - your job will involve writing detailed documentation for others to pick up or leading knowledge-sharing sessions with operations teamsBonus Skills Include :
Hands-on experience in High Performance Computing (HPC) clustered environments from Nvidia or AMDExperience in performing automated wide-scale testing on NCCL or other frameworksHands-on experience in qualification automation with specific focus on developing testing within an automation framework for hands-free qualificationWhat You'll Be Working On :
Onsite support of our hardware qualification efforts in NYC3 and SFO2Hardware qualification of new server SKUs for Compute and GPU Hypervisor , Storage , and Infrastructure server hardwareHardware validation against design targets (functional and performance related)Hardware reconfiguration to support different testing efforts (changes to server components)Troubleshooting hardware integration with the platform operational tooling (onboarding)Firmware validation and qualificationPerformance testing, analysis, and monitoringFirmware, BIOS, Kernel upgrades and testing