Talent.com
Software Engineer, SystemML - Scaling / Performance

Software Engineer, SystemML - Scaling / Performance

METAMenlo Park, CA, United States
1 day ago
Job type
  • Full-time
Job description

Summary :

In this role, you will be a member of the Network.AI Software team and part of the bigger DC networking organization. The team develops and owns the software stack around NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives. NCCL has been integrated into PyTorch and is on the critical path of multi-GPU distributed training. In other words, nearly every distributed GPU-based ML workload in Meta Production goes through the SW stack the team owns.At the high level, the team aims to enable Meta-wide ML products and innovations to leverage our large-scale GPU training and inference fleet through an observable, reliable and high-performance distributed AI / GPU communication stack. Currently, one of the team's focus is on building customized features, SW benchmarks, performance tuners and SW stacks around NCCL and PyTorch to improve the full-stack distributed ML reliability and performance (e.g. Large-Scale GenAI / LLM training) from the trainer down to the inter-GPU and network communication layer. And we are seeking for engineers to work on the space of GenAI / LLM scaling reliability and performance.

Required Skills :

Software Engineer, SystemML - Scaling / Performance Responsibilities :

  • Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infra with a focus on GenAI / LLM scaling

Minimum Qualifications :

Minimum Qualifications :

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience

Specialized experience in one or more of the following machine learning / deep learning domains : Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch)

Preferred Qualifications :

Preferred Qualifications :

Knowledge of GPU architectures and CUDA programming

Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow

Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models

PhD in Computer Science, Computer Engineering, or relevant technical field

Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel

Experience in HPC and parallel computing

Knowledge of ML, deep learning and LLM

Experience with NCCL and distributed GPU reliability / performance improvment on RoCE / Infiniband

Public Compensation :

$70.67 / hour to $208,000 / year + bonus + equity + benefits

Industry : Internet

Equal Opportunity :

Meta is proud to be an Equal Employment Opportunity and Affirmative Action employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender, gender identity, gender expression, transgender status, sexual stereotypes, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics. We also consider qualified applicants with criminal histories, consistent with applicable federal, state and local law. Meta participates in the E-Verify program in certain locations, as required by law. Please note that Meta may leverage artificial intelligence and machine learning technologies in connection with applications for employment.

Meta is committed to providing reasonable accommodations for candidates with disabilities in our recruiting process. If you need any assistance or accommodations due to a disability, please let us know at accommodations-ext@fb.com.

Create a job alert for this search

Engineer Performance • Menlo Park, CA, United States

Related jobs
  • Promoted
Software Engineer, Systems ML - Frameworks / Compilers / Kernels

Software Engineer, Systems ML - Frameworks / Compilers / Kernels

METAMenlo Park, CA, United States
Full-time
In this role, you will be a member of the MTIA (Meta Training & Inference Accelerator) Software team and part of the bigger industry-leading PyTorch AI framework organization.MTIA Software Team has...Show moreLast updated: 30+ days ago
  • Promoted
Sr. Software Engineer, ML

Sr. Software Engineer, ML

Relyance AISan Francisco, CA, United States
Full-time
NLP for information extraction from legal documents, ML / NLP for information extraction from code and general ML in code analysis, as well as overall AI backend initiatives.You will partner with cro...Show moreLast updated: 1 day ago
  • Promoted
Senior Software Engineer - ML Infrastructure

Senior Software Engineer - ML Infrastructure

PlaidSan Francisco, CA, United States
Full-time
Plaid is evolving into an AI-first company, where data and machine learning are the key enablers of smarter, more secure insight products built on top of Plaid's vast financial data network.The Mac...Show moreLast updated: 30+ days ago
  • Promoted
Software Engineer - AI SysML (Technical Leadership)

Software Engineer - AI SysML (Technical Leadership)

METAMenlo Park, CA, United States
Full-time
Meta is seeking an AI Software Engineer to join our Research & Development teams.The ideal candidate will have industry experience working on AI Infrastructure related topics.The position will invo...Show moreLast updated: 30+ days ago
  • Promoted
Software Engineer, ML Inference, Simulation Infrastructure

Software Engineer, ML Inference, Simulation Infrastructure

WaymoSan Francisco, CA, United States
Full-time
Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show moreLast updated: 1 day ago
  • Promoted
Software Engineer, Systems ML

Software Engineer, Systems ML

METAMenlo Park, CA, United States
Full-time
Meta), formerly known as Facebook Inc.When Facebook launched in 2004, it changed the way people connect.Apps and services like Messenger, Instagram, and WhatsApp further empowered billions around t...Show moreLast updated: 30+ days ago
  • Promoted
Senior Software Engineer, ML Platform

Senior Software Engineer, ML Platform

AttentiveSan Francisco, CA, United States
Full-time
Attentive is the AI-powered mobile marketing platform transforming how brands personalize consumer engagement.Attentive enables marketers to craft tailored journeys for every subscriber, driving hi...Show moreLast updated: 30+ days ago
  • Promoted
Senior Software Engineer, ML Infrastructure

Senior Software Engineer, ML Infrastructure

LMArenaSan Francisco, CA, United States
Full-time
Senior Software Engineer, ML Infrastructure.Senior Software Engineer (Infrastructure).In this role, you'll architect systems that capture and process large volumes of serving requests in real time,...Show moreLast updated: 30+ days ago
  • Promoted
Founding Engineer, ML Performance & Systems

Founding Engineer, ML Performance & Systems

Isotron AISan Francisco, CA, United States
Full-time
We’re an early-stage stealth startup building a new kind of platform for generative media.Our mission is to enable the future of real-time generative applications : we’re building the foundational t...Show moreLast updated: 30+ days ago
  • Promoted
Senior Software Engineer - ML Infrastructure in San Francisco

Senior Software Engineer - ML Infrastructure in San Francisco

Energy Jobline ZRSan Francisco, CA, United States
Full-time
Energy Jobline is the largest and fastest growing global Energy Job Board and Energy Hub.We have an audience reach of over 7 million energy professionals, 400,000+ monthly advertised global energy ...Show moreLast updated: 1 day ago
  • Promoted
Principal Software Engineer, ML Systems

Principal Software Engineer, ML Systems

WaymoMountain View, California, United States
Full-time
Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show moreLast updated: 1 day ago
  • Promoted
Senior Software Engineer, ML Platform in San Francisco

Senior Software Engineer, ML Platform in San Francisco

Energy Jobline ZRSan Francisco, CA, United States
Full-time
Energy Jobline is the largest and fastest growing global Energy Job Board and Energy Hub.We have an audience reach of over 7 million energy professionals, 400,000+ monthly advertised global energy ...Show moreLast updated: 1 day ago
  • Promoted
Senior Software Engineer - Machine Learning Platform

Senior Software Engineer - Machine Learning Platform

Snowflake ComputingMenlo Park, CA, United States
Full-time
Snowflake is about empowering enterprises to achieve their full potential - and people too.With a culture that's all in on impact, innovation, and collaboration, Snowflake is the sweet spot for bui...Show moreLast updated: 30+ days ago
  • Promoted
AIML - Sr. Software Engineer, ML Platform Technologies (MLPT)

AIML - Sr. Software Engineer, ML Platform Technologies (MLPT)

AppleSan Francisco, CA, United States
Full-time
Want to build the platform that enables the next generation of intelligent experiences on Apple products & services? As a software engineer on the Machine Learning Platform team, you will be respon...Show moreLast updated: 1 day ago
  • Promoted
Senior, Software Engineer - MLE

Senior, Software Engineer - MLE

WalmartSunnyvale, CA, United States
Full-time
We are seeking a highly motivated Machine Learning Engineer to join our Data Science team.In this role you will design, develop, and deploy machine learning models at scale and play a key role in s...Show moreLast updated: 28 days ago
  • Promoted
Principal AI / ML System Software Engineer

Principal AI / ML System Software Engineer

d-MatrixSanta Clara, CA, United States
Full-time
AI to power the transformation of technology.We are at the forefront of software and hardware innovation, pushing the boundaries of what is possible. We value humility and believe in direct communic...Show moreLast updated: 30+ days ago
  • Promoted
Software Engineer - ML

Software Engineer - ML

SupioSan Francisco, CA, United States
Full-time
Get AI-powered advice on this job and more exclusive features.Retrieved from the description.Who Are We Looking to Add to Our Team?. We're seeking Machine Learning Engineers to drive and scale the v...Show moreLast updated: 1 day ago
  • Promoted
Sr Software Engineer - ML CV

Sr Software Engineer - ML CV

DatalogicSan Francisco, CA, United States
Full-time
Senior Software Engineer - ML CV.As a Machine Learning Engineer at Datalogic, you will collaborate with a dynamic team of experts to develop cutting-edge machine learning and deep learning solution...Show moreLast updated: 1 day ago