Talent.com
Platform Reliability & Lab Support Engineer
Platform Reliability & Lab Support EngineerSustainable Talent • Santa Clara, CA, United States
Platform Reliability & Lab Support Engineer

Platform Reliability & Lab Support Engineer

Sustainable Talent • Santa Clara, CA, United States
2 days ago
Job type
  • Full-time
Job description

Sustainable Talent is partnering with Nvidia a global leader who's been transforming computer graphics, PC gaming, and accelerated computing for over 25 years.

We are looking for a Platform Reliability & Lab Support Engineer to support our client's dynamic team responsible for maintaining and optimizing our Colossus cloud infrastructure, including data centers and labs. This is a W-2 full-time contract based in Santa Clara, CA. We offer competitive pay $70 / hr - $80 / hr based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company culture!

The ideal candidate will have a strong technical background, excellent problem-solving skills, and a passion for ensuring the reliability and efficiency of our infrastructure. In this role, you will be faced the challenge of providing a test-bed for our developers to test software on various NVIDIA hardware before releasing them. Additionally, collaborate with Infrastructure Engineers, installing and maintaining Windows / Linux platforms and using creativity while finding solutions. We expect things to break in these DCs / labs, as the software is mostly low level device drivers, and the bugs in them do break boards and GPUs. Our labs run more than 100,000 tests per day and is part of a DevOps pipeline that needs constant supervision, tracking, monitoring and break-fix.

What you'll be doing :

  • Assist in the installation, configuration, and deployment of new hardware and software components.
  • Conduct regular inspections and audits of infrastructure components to identify and address potential issues proactively.
  • Collaborate with cross-functional teams to implement and test new technologies and solutions.
  • Document maintenance activities, troubleshooting procedures, and system configurations.
  • Participate in on-call rotation and respond to emergency situations as needed.
  • Handling Labs and Datacenters using DCIM Tools, spreadsheets and task tracking tools.
  • Your responsibilities will also include defining standards in labs to keep them safe, clean and organized.
  • Perform routine maintenance tasks on servers, networking equipment, and other infrastructure components in data centers and labs.
  • Troubleshoot hardware and software issues to ensure uninterrupted operation of critical systems.
  • Deploy test boards that run automated tests from a Software Developers and triage and root cause board issues which are not due to hardware or software issues but, that potentially have test setup issues.
  • Remove and redeploy boards that need software and / or hardware upgrades from board engineers in a regular cadence.
  • Work closely and pro-actively with other engineering teams such as system architects, chip and board designers, software / firmware engineers, HW / SW QA teams and Applications engineering teams to drive design, development, debug and release of next generations products.
  • Take active part in procurement decisions for Lab by choosing from various options available, getting test copies and doing proof of concepts and then providing recommendations.

Collect data for critical metrics for the lab and track progress.

What we need to see :

  • Associates or Bachelors Degree in a Tech related Major or 4+ years of equivalent experience in a Lab or Datacenter environment.
  • Ability to perform well at work without requiring constant manager supervision.
  • Ability to do deploy and cable servers and test equipment.
  • Proven experience working with data center infrastructure, including servers, storage systems, and networking equipment.
  • Strong knowledge of hardware components.
  • Basic user level understanding of Unix / Windows, and Networking with Enterprise Switches and Routers.
  • Skills to work with teammates of various abilities and experiences.
  • Ability to find tasks where you need help from sysadmins and communicate those, coordinate with them to integrate those solutions.
  • Perseveration to debug a hard problem and out of box thinking to seek those.
  • To be successful in this position, you should have a love of working with close-knit, multi-disciplinary teams, and enjoy hands-on work with state of the art platforms.
  • Ways to stand out from the crowd :

  • Visio and CAD experience for Lab R&D projects and Rack Management.
  • Lab / Datacenter Procurement Experience.
  • Experience with handling PDUs and Power in Labs.
  • System administrator level experience on Unix / windows and knowledge of scripting to automate workflows (bash / python).
  • Basic knowledge of Git / Perforce to check-out, edit and check-in scripts.
  • Ability to write SQL queries to get data from MySQL DBs.
  • Sustainable Talent is a M / F+, disabled, and veteran equal employment opportunity and affirmative action employer.

    Create a job alert for this search

    Reliability Engineer • Santa Clara, CA, United States

    Related jobs
    Reliability Engineer Lead - Custom Silicon Management

    Reliability Engineer Lead - Custom Silicon Management

    Apple • Cupertino, CA, United States
    Full-time
    Do you love crafting sophisticated solutions to highly complex challenges? Do you intrinsically see the importance in every detail? As part of our Silicon Technologies group, you’ll help design and...Show more
    Last updated: 30+ days ago • Promoted
    Reliability Engineer

    Reliability Engineer

    Medium • Palo Alto, CA, United States
    Full-time
    Pivotal is the leader in the emerging market of electric Vertical Takeoff and Landing (eVTOL) aircraft.We design, develop, and manufacture light eVTOL aircraft and are renowned for the BlackFly, th...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantum • Palo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more
    Last updated: 30+ days ago • Promoted
    Lead Database Reliability Engineer

    Lead Database Reliability Engineer

    Qualys • Foster City, CA, United States
    Full-time
    Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!.The Qualys SaaS platform is database centric and relies heavily on Oracle, Elast...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Fortinet • Sunnyvale, CA, United States
    Full-time
    At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...Show more
    Last updated: 26 days ago • Promoted
    Reliability Engineer, Energy Storage

    Reliability Engineer, Energy Storage

    Redwood Materials, Inc. • San Francisco, CA, United States
    Full-time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recyclingkeeping critical minerals in circulation and driving the energy transition.Founded in 20...Show more
    Last updated: 2 days ago • Promoted
    Senior Hardware Reliability Engineer - Design for Reliability

    Senior Hardware Reliability Engineer - Design for Reliability

    ZipLine • South San Francisco, CA, United States
    Full-time
    Senior Hardware Reliability Engineer - Design for Reliability.Engineering | Hardware Reliability.Do you want to change the world? Zipline is on a mission to transform the way goods move.Our aim is ...Show more
    Last updated: 2 days ago • Promoted
    Sr. Reliability Engineer (26861)

    Sr. Reliability Engineer (26861)

    Supermicro • San Jose, CA, United States
    Full-time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...Show more
    Last updated: 5 days ago • Promoted
    Reliability Engineer (Regulated Industry)

    Reliability Engineer (Regulated Industry)

    Mentor Technical Group • San Francisco, CA, United States
    Full-time
    Mentor Technical Group Job Opportunity.Mentor Technical Group (MTG) provides a comprehensive portfolio of technical support and solutions for the FDA-regulated industry. As a world leader in life sc...Show more
    Last updated: 2 days ago • Promoted
    Founding Site Reliability Engineer

    Founding Site Reliability Engineer

    Relevance AI • San Francisco, CA, United States
    Full-time
    San Francisco, USA (Hybrid 3 days / week).At Relevance AI, our mission is to empower anyone to delegate work to the AI workforce. Were building a new category of AI automation, enabling teams to creat...Show more
    Last updated: 2 days ago • Promoted
    Reliability EngineerCupertino, CA

    Reliability EngineerCupertino, CA

    ETCHED LLC • Cupertino, CA, United States
    Full-time
    Etched is building AI chips that are hard-coded for individual model architectures.Our first product (Sohu) only supports transformers, but has an order of magnitude more throughput and lower laten...Show more
    Last updated: 1 day ago • Promoted
    Founding Site Reliability Engineer

    Founding Site Reliability Engineer

    Assort Health • San Francisco, CA, United States
    Full-time
    Our mission is to make exceptional healthcare accessible anytime, anywhere, for everyone.At Assort Health, we believe healthcare should feel effortless and connected — quick answers, clear communic...Show more
    Last updated: 9 days ago • Promoted
    Reliability Engineer (Hardware)

    Reliability Engineer (Hardware)

    Lightmatter • Mountain View, CA, United States
    Full-time
    Lightmatter is leading the revolution in AI data center infrastructure, enabling the next giant leaps in human progress.The company invented the world's first 3D-stacked photonics engine, Passage™,...Show more
    Last updated: 2 days ago • Promoted
    Solutions Engineer

    Solutions Engineer

    Omada Health • South San Francisco, CA, United States
    Full-time
    Omada Health is on a mission to inspire and engage people in lifelong health, one step at a time.Omada Health is seeking a new member of the Solutions Engineering team to facilitate technical integ...Show more
    Last updated: 19 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Signify Technology • Palo Alto, CA, US
    Full-time
    Competitive, based on experience.We are a technology startup advancing healthcare with a safety-focused AI platform that assists medical professionals by managing patient communications, including ...Show more
    Last updated: 20 days ago • Promoted
    Reliability Systems Engineer | EAG Laboratories

    Reliability Systems Engineer | EAG Laboratories

    Eurofins | EAG Laboratories • Santa Clara, CA, United States
    Full-time +1
    Eurofins Scientific is a global leader in analytical testing, operating over 950 labs in 60 countries with 65,000 employees. EAG Laboratories, part of Eurofins, offers advanced services in analytica...Show more
    Last updated: 2 days ago • Promoted
    Founding Site Reliability Engineer

    Founding Site Reliability Engineer

    Reducto • San Francisco, CA, United States
    Full-time
    Reducto helps AI teams ingest real world enterprise data with state of the art accuracy.The vast majority of enterprise data - from financial statements to health records - is locked in unstructure...Show more
    Last updated: 2 days ago • Promoted
    Hardware Support Engineer

    Hardware Support Engineer

    Cognizant • Menlo Park, CA, US
    Full-time
    Cognizant is a leading provider IT and BPO services, providing critical initiatives to a variety of global clients.The Hardware Operations team is a part of a high profile client project that provi...Show more
    Last updated: 1 day ago • Promoted