Talent.com
Principal Engineer - Performance AI / ML Network Deployment Engineering

Principal Engineer - Performance AI / ML Network Deployment Engineering

Advanced Micro DevicesSanta Clara, CA, United States
4 days ago
Job type
  • Full-time
Job description

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

THE ROLE

The Principal Engineer DC GPU AI / ML Advanced Forward Deployment and Systems Engineering is a leadership position designed to optimize the design, roll-out and post-rollout management of AI / ML Fabrics. The candidate will be the technical interface between the customers and various internal engineering groups, field application engineers Leveraging extensive experience in large network architecture, Storage, AI / ML network deployments, and performance tuning, this role requires a disciplined approach to system triage, at-scale debug, and infrastructure optimization to ensure robust performance and efficient transitions from GPU production qualification to at-scale datacenter deployment.

THE PERSON

This position is for a Principal Engineer DC GPU AI / ML Advanced Forward Deployment and Systems Engineering with a focus on architecture, design, optimizing the compute, network, and storage and benchmarking the Machine Learning applications. You will be part of a team closely work with strategic customers and partners to enable large scale deployment of AMD CPU and GPU platforms. You will closely interface with ROCm software developers, DC GPU HW / FW / ASIC Teams, Field Engineering Teams, OEM / ODM partners, CSPs, and Marketing / Business Development teams. Must be self-motivated and possess the ability to work well within a team environment.

KEY RESPONSIBILITIES

  • Collaborate with strategic customers on scalable designs involving compute, networking, storage environment, work with industry partners, Internal teams to accelerate the deployment, adoption of various AI / ML models.
  • Engage system-level triage and at-scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability.
  • Drive the ramp of Instinct-based large scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level, leveraging the best in network architecture for AI / ML workloads.
  • Enhance tools and methodologies for large-scale deployments to meet customer uptime goals and exceed performance expectations.
  • Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions that leverage your past experience in strategic customer engagements and architectural wins.
  • Provide domain specific knowledge to other groups at AMD, share the lessons learnt to drive continuous improvement.
  • Engage with AMD product groups to drive resolution of application and customer issues
  • Develop and present training materials to internal audiences, at customer venues, and at industry conferences

PREFERRED EXPERIENCE

  • Expertise in networking and performance optimization for large-scale AI / ML networks, including network, compute, storage cluster design, modelling, analytics, performance tuning, convergence, scalability improvements.
  • Prefer candidates with solid, hands on expertise in at least one or more of 3 domains , namely compute, network, storage.
  • Demonstrated leadership in network architecture, hands on experience in RoCEv2 Design, VXLAN-EVPN, BGP, and Lossless Fabrics
  • Deep experience in working with large customers such as Cloud Service Providers and global enterprise customers
  • Proven leadership in engaging customers with diverse technical disciplines in avenues such as Proof of Concept, Competitive evaluations, Early Field Trials etc.
  • Direct experience in working with large customers and can operate with sense of urgency, own the problems and resolve it
  • Extensive experience in Python, Linux, Kernel modules, Application libraries, unless accompanied by other skill sets in the space.
  • Proven ability to influence design and technology roadmaps, leveraging a deep understanding of datacenter products and market trends.
  • Extensive hands-on Network deployment expertise and proven track record of delivering large projects on time. Cisco, Juniper or Arista Experience is required.
  • Direct, co-development / deployment experience in working with strategic customers / partners in bringing solutions to market.

  • Excellent communication level from engineer to mid-management to C-level of audience.
  • This is a Senior level role; no recent college graduates will be considered.
  • ACADEMIC CREDENTIALS

  • Bachelors, master's in computer science ,Engineering or related of experience
  • Ability to work well in a geographically dispersed team.
  • Certifications in Networking, AI / ML, or Cloud Technologies.
  • Benefits offered are described :

    AMD benefits at a glance

    Equal Opportunity Statement

    AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and / or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

    #J-18808-Ljbffr

    Create a job alert for this search

    Principal Network Engineer • Santa Clara, CA, United States

    Related jobs
    • Promoted
    Principal Engineer - Performance AI / ML Network Deployment Engineering

    Principal Engineer - Performance AI / ML Network Deployment Engineering

    Advanced Micro Devices, Inc.Santa Clara, CA, United States
    Full-time
    WHAT YOU DO AT AMD CHANGES EVERYTHING.At AMD, our mission is to build great products that accelerate next‑generation computing experiences—from AI and data centers, to PCs, gaming and embedded syst...Show moreLast updated: 2 days ago
    • Promoted
    Principal AI / ML Engineer

    Principal AI / ML Engineer

    WEX, Inc.San Francisco, CA, United States
    Full-time
    Lead and drive the development of technology and platform for the company's AI / ML engineering needs, ensure the functional richness, reliability, performance, and flexibility of this platform.Help ...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Machine Learning Engineer

    Principal Machine Learning Engineer

    General MotorsSunnyvale, CA, United States
    Full-time
    We are seeking a Principal AI Engineer to lead the design and advancement of our AI platform.You will play a key role in shaping the infrastructure that powers large-scale training and cloud infere...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Machine Learning Engineer, Firefly

    Principal Machine Learning Engineer, Firefly

    Adobe Inc.San Jose, CA, US
    Full-time
    Our Company Changing the world through digital experiences is what Adobe is all about.We empower everyone—from emerging artists to global brands—to design and deliver exceptional digital experience...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Network Architect

    Principal Network Architect

    NVIDIASanta Clara, CA, United States
    Full-time
    Be among the first 25 applicants.NVIDIA Enterprise Network Architecture team is seeking experienced candidates in the extensive domain of network architecture & engineering.This is a hands‑on archi...Show moreLast updated: 2 days ago
    • Promoted
    AI Infrastructure Engineer, Model Serving Platform

    AI Infrastructure Engineer, Model Serving Platform

    Scale AI, Inc.San Francisco, CA, United States
    Full-time
    As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs. Our platform powers cutting-edge research and product...Show moreLast updated: 30+ days ago
    • Promoted
    Principal / Senior Principal Machine Learning Engineer, AI Enablement

    Principal / Senior Principal Machine Learning Engineer, AI Enablement

    GenentechSan Francisco, CA, United States
    Full-time
    We advance science so that we all have more time with the people we love.It’s what drives us to innovate.To continuously advance science and ensure everyone has access to the healthcare they need t...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Engineer - AI Tools

    Principal Engineer - AI Tools

    UberSan Francisco, CA, United States
    Full-time
    At Uber, Developer productivity is a cornerstone of our innovation engine - productive developers will deliver more features faster to our world-wide end users. We are seeking a world-class Principa...Show moreLast updated: 4 days ago
    • Promoted
    Principal Engineer, AI Engineering

    Principal Engineer, AI Engineering

    OktaSan Francisco, CA, United States
    Full-time
    Okta is The World's Identity Company.We free everyone to safely use any technology, anywhere, on any device or app.Our flexible and neutral products, Okta Platform and Auth0 Platform, provide secur...Show moreLast updated: 4 days ago
    • Promoted
    Senior / Principal Machine Learning Engineer, Performance DSP

    Senior / Principal Machine Learning Engineer, Performance DSP

    PubMatic, Inc.Redwood City, CA, United States
    Full-time
    Senior / Principal Machine Learning Engineer, Performance DSP.PubMatic is one of the world’s leading scaled digital advertising platforms, offering more transparent advertising solutions to publishe...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Machine Learning Engineer - Central AI

    Principal Machine Learning Engineer - Central AI

    AtlassianSan Francisco, CA, United States
    Full-time
    Atlassian is seeking a Principal Machine Learning Scientist to join our Central AI team located in Bellevue WA.The Central AI organization constructs the fundamental infrastructure, data pipeline, ...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Machine Learning Engineer

    Principal Machine Learning Engineer

    TubiSan Francisco, CA, United States
    Full-time
    About Tubi : Boldly built for every fandom, Tubi is a free streaming service that entertains over 100 million monthly active users. Tubi offers the world's largest collection of Hollywood movies and ...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Principal Machine Learning Engineer

    Principal Machine Learning Engineer

    ServiceNow, Inc.Santa Clara, CA, United States
    Full-time
    It all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today — ServiceNow stands as a global market ...Show moreLast updated: 5 hours ago
    • Promoted
    Principal Architect, AI Networking

    Principal Architect, AI Networking

    NVIDIA CorporationSanta Clara, CA, United States
    Full-time
    Principal Architect, AI Networking page is loaded## Principal Architect, AI Networkinglocations : US, CA, Santa Clara : US, TX, Austin : US, TX, Remote : US, CO, Remote : US, OR, Remotetime ty...Show moreLast updated: 2 days ago
    • Promoted
    Principal Performance Engineer

    Principal Performance Engineer

    ZoomSan Jose, CA, United States
    Full-time
    Immigration sponsorship is not available for this position.What you can expectZoom is seeking a highly experienced and impactful Principal Performance Engineer to join our DevOps / SRE team.In this c...Show moreLast updated: 23 hours ago
    • Promoted
    Principal AI / ML Engineer

    Principal AI / ML Engineer

    Kanak Elite ServicesSan Francisco, CA, United States
    Full-time
    Portland, ME; Boston, MA; Chicago, IL; and San Francisco, CA (hybrid 1-3 days onsite per week).Have 15-20 years of software design and development experience at a large scale.Lead and drive the dev...Show moreLast updated: 2 days ago
    • Promoted
    Principal System Networking Architect

    Principal System Networking Architect

    NVIDIA CorporationSanta Clara, CA, United States
    Full-time
    Our technology is crucial for global innovators, scientists, researchers, and engineers, empowering them to transform their boldest concepts into tangible outcomes. Our next-generation Infiniband, N...Show moreLast updated: 4 days ago
    • Promoted
    Principal Machine Learning Platform Engineer (Prisma AIRS)

    Principal Machine Learning Platform Engineer (Prisma AIRS)

    Palo Alto NetworksSanta Clara, CA, United States
    Full-time
    With Prisma AIRS, Palo Alto Networks is building the world's most comprehensive AI security platform.Organizations are increasingly building complex ecosystems of AI models, applications, and agent...Show moreLast updated: 3 days ago