Production Systems Engineer, Fleet AI Systems

Meta
Menlo Park, CA, United States
Full-time

Meta is seeking an experienced Production Systems Engineer to join our Release to Production (RTP) team. Our servers and data centers are the foundation upon which our rapidly scaling infrastructure operates efficiently to deliver our innovative services.

The RTP team is responsible for the Hardware Lifecycle of all Meta servers including pre-production hands-on system and hardware debugging and stress testing, enabling production-ready system monitoring, automated provisioning and automated remediation of issues.

RTP Engineers work closely with hardware designers, system manufacturers, component vendors, capacity engineering, production engineering, Facebook services, and data center operations teams to test systems before release to our production data centers, and to track the health and lifecycle of servers in production.

Production Systems Engineer, Fleet AI Systems Responsibilities

  • Develop robust, industry leading practices for supporting hardware infrastructure at scale Interface with external vendors and internal hardware, mechanical, power, thermal, manufacturing and software engineers to understand system architecture to develop and execute the test suites for various architectures
  • Proactively create experiments and tooling to detect and diagnose hardware / firmware / software health issues
  • Implement remediations across software and hardware stack according to plan, while keeping a thorough procedural record and data log
  • Develop and publish updates on resolutions and communicate findings internally
  • Troubleshoot, diagnose and root cause of system failures and isolate the components / failure scenarios while working with internal & external stakeholders
  • Drive necessary discussion with external and internal teams on test specification and methodologies to improve test quality continuously

Minimum Qualifications

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
  • 10+ years experience in hardware systems technologies or supporting production hardware at scale
  • Troubleshooting and analytical experience
  • Knowledge of server architecture and components
  • Experience with Linux and scripting
  • Experience in changing system configurations and measuring change impact
  • Experience working in a matrix organization
  • Experience working through full life cycle for computer system products
  • Experience supporting AI / HPC systems, GPU or Silicon hardware, and / or related components at scale
  • Engineering for different server system / data center products

Preferred Qualifications

  • 10+ years experience in Production support at scale (e.g. - 10K storage servers and over 100K HDD)
  • 10+ years experience in full system technologies
  • Experience in post-production hyperscale post-production environments, solutions

Start preparing

Learn about how to prepare for your interview with our interview guide, tips, and interactive experiences.

Visit interview prep

27 days ago
Related jobs
Promoted
Apple
Cupertino, California

You will work on defining system level concepts, proposing, and researching innovative ideas & algorithms, performing sophisticated system simulations, defining and working on rapid prototyping platforms to help prove your ideas for current and next generation (5G/6G) cellular systems with strong AI...

Promoted
DeepSight Technology
Santa Clara, California

Senior Imaging Systems Software Engineer. As our Senior Imaging Systems Software Engineer, you'll enjoy a competitive salary ranging from. As our Senior Imaging Systems Software Engineer, you'll be instrumental in advancing the quality and interpretation of ultrasound images. This new techno...

Promoted
Intuitive Surgical
Sunnyvale, California

Minimum Bachelor of Science in Systems Engineering, Electrical Engineering or Computer Engineering. We are seeking a Senior Electrical Systems Engineer to design, test, and evaluate audio hardware for existing and future production versions of Intuitive products. Collaborate with NPI and other engin...

Microchip Technology
San Jose, California

FPGAs to create systems level designs to bring up and debug such systems in the lab. Act as the authoritative expert in knowledge domain area(s), be able to mentor senior and junior engineers and provide technical guidance. Visit our page to see what exciting opportunities and company await!. Must b...

Advanced Micro Devices, Inc
Santa Clara, California

AMD together we advance_ Debug & Validation Engineer / VMware Enablement The Datacenter Ecosystems and Applications Engineering team (DEAE) is looking for a senior systems engineer to join our team responsible for engineering collaboration with VMware development, solution, and product teams to enab...

Databricks
Mountain View, California

You will be a key member of the Corporate Engineering team to build out governance, best practices and automation for our Atlassian Project Management Systems (Jira, Confluence). Manager, Systems Engineering as part of the CIO's organization. You will be responsible for the overall health and availa...

A Society Group, Inc.
Foster City, California

Be heavily involved in all phases of rolling out services and designing systems that are robust and easy to maintain and operate. Building and maintaining our IT infrastructure platform. Experience in configuration and maintenance of Linux/Unix applications such as web servers, load balancers, stora...

Exponent
Menlo Park, California

Security-Focused Embedded Systems Engineer/Evaluator. You will work to help test materials, components, and systems to assess them against standards and expectations relating to durability, reliability, quality, interoperability, performance, and security. You may assess component performance agains...

Zoox
Foster City, California

The Software Systems Infrastructure team is responsible for aiding the Software Systems organization with all its internal tooling needs, and development processes and ensuring that all safety-critical software meets a high safety bar for production vehicles. Borrowing principles from avionics, auto...

Vantage Data Centers
Santa Clara, California

The Electrical Reliability Engineer understands potential system failures and extended effects of those failures, appropriate failure response actions, as well as maintenance and electrical testing techniques. Provide systems reliability and maintainability feedback to the Design Engineering teams f...