Training : ML Framework Engineer

OpenAISan Francisco, CA, United States

22 hours ago

Job type

Full-time

Job description

About the Team

Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate researchers and enable frontier scale, we're building a unified, modular runtime that meets researchers where they are and moves with them up the scaling curve.

Our work focuses on three pillars : high-performance, asynchronous, zero-copy tensor and optimizer-state-aware data movement; performant, high-uptime, fault-tolerant training frameworks (training loop, state management, resilient checkpointing, deterministic orchestration, and observability); and distributed process management for long-lived, job-specific and user-provided processes.

We integrate proven large-scale capabilities into a composable, developer-facing runtime so teams can iterate quickly and run reliably at any scale, partnering closely with model-stack, research, and platform teams. Success for us is measured by raising both training throughput (how fast models train) and researcher throughput (how fast ideas become experiments and products).

About the Role

As a Training : ML Framework Engineer, you will work on improving the training throughput for our internal training framework, while enabling researchers to experiment with new ideas. This requires good engineering (for example designing, implementing, and optimizing state-of-the-art AI models), writing bug-free machine learning code (surprisingly difficult!), and acquiring deep knowledge of the performance of supercomputers. In all the projects this role pursues, the ultimate goal is to push the field forward.

We're looking for people who love optimizing performance, understanding distributed systems, and who cannot stand having bugs in their code. Since our training framework is used for large runs with massive numbers of GPUs, performance improvements here will have a large impact.

This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In this role, you will :

Apply the latest techniques in our internal training framework to achieve impressive hardware efficiency for our training runs

Profile and optimize our training framework

Work with researchers to enable them to develop the next generation of models

You might thrive in this role if you :

Have run small scale ML experiments

Love figuring out how systems work and continuously come up with ideas for how to make them faster while minimizing complexity and maintenance burden

Have strong software engineering skills and are proficient in Python

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.

For additional information, please see OpenAI's Affirmative Action and Equal Employment Opportunity Policy Statement.

Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act, for US-based candidates. For unincorporated Los Angeles County workers : we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment : protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.

To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.

We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.

OpenAI Global Applicant Privacy Policy

At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.

Compensation Range : $245K - $385K

Create a job alert for this search

Ml Engineer • San Francisco, CA, United States

Related jobs

Promoted
New!

Staff ML Platform Engineer Large Scale Training (LLMOps / MLOps)

SocotraSan Francisco, CA, United States

Full-time

Build the Future of Scalable AI at TrueFoundry.ML teams train, deploy, and scale their models.Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models rel...Show moreLast updated: 22 hours ago

Promoted
New!

Machine Learning Engineer, Training Infrastructure

Intellipro GroupSan Francisco, CA, United States

Full-time

Machine Learning Engineer, Training Infrastructure.We are looking for an ML Engineer with 3+ YOE in high-performance computing systems to manage and optimize our computational infrastructure for tr...Show moreLast updated: 22 hours ago

Promoted

Software Engineer (Machine Learning)

METAMenlo Park, CA, United States

Full-time

Meta), formerly known as Facebook Inc.When Facebook launched in 2004, it changed the way people connect.Apps and services like Messenger, Instagram, and WhatsApp further empowered billions around t...Show moreLast updated: 30+ days ago

Promoted

ML Research Engineer - Training

AchiraSan Francisco, CA, United States

Full-time

Join a world‑class team of scientists, ML researchers, and engineers working together to make the physical microcosm predictable and reshape the future of drug discovery. Move beyond the beaten path...Show moreLast updated: 25 days ago

Promoted
New!

Software Engineer L4 / L5 Training Platform, Machine Learning Platform

NetflixSan Francisco, CA, United States

Full-time

Netflix is one of the world's leading entertainment services, with over 300 million paid memberships in over 190 countries enjoying TV series, films and games across a wide variety of genres and la...Show moreLast updated: 22 hours ago

Promoted
New!

Machine Learning Engineer - Training & Infrastructure

P-1 AISan Francisco, CA, United States

Full-time

We are building an engineering AGI.We founded P-1 AI with the conviction that the greatest impact of artificial intelligence will be on the built worldhelping mankind conquer nature and bend it to ...Show moreLast updated: 22 hours ago

Promoted
New!

MLE, ML Platform

zaimlerSan Mateo, CA, United States

Full-time

We're creating the foundation for AI systems that don't just generate, but retrieve, link, and reason over enterprise knowledge. In just over a year, we've begun partnering with Fortune 500 design p...Show moreLast updated: 22 hours ago

Promoted
New!

Senior Machine Learning Engineer, LLM / VLM Continual Pre-training

WaymoSan Francisco, CA, United States

Full-time

Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show moreLast updated: 22 hours ago

Promoted
New!

Machine Learning Engineer, LLM / VLM Continual Pre-training

WaymoSan Francisco, CA, United States

Full-time

Promoted

LLM Inference Frameworks and Optimization Engineer

Together AISan Francisco, CA, United States

Full-time

Our mission is to optimize inference frameworks, algorithms, and infrastructure, pushing the boundaries of performance, scalability, and cost-efficiency. We are seeking anInference Frameworks and Op...Show moreLast updated: 30+ days ago

Promoted

Machine Learning Engineer, Training Infrastructure

HEDRA INCSan Francisco, CA, United States

Full-time

Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show moreLast updated: 30+ days ago

Promoted
New!

Machine Learning Engineer, Training Infrastructure

HedraSan Francisco, California, United States

Full-time

Machine Learning Engineer, Training Infrastructure.Machine Learning Engineer, Training Infrastructure.We are looking for an ML Engineer with 3+ years of experience in high-performance computing sys...Show moreLast updated: 12 hours ago

Promoted

Machine Learning Engineer, Training Infrastructure

Hedra, IncSan Francisco, CA, United States

Full-time

Promoted
New!

Machine Learning Engineer - Post Training

EPM ScientificSan Francisco, CA, United States

Full-time

Machine Learning Engineer - Post Training.A stealth-stage venture backed by Lux Capital (investors in DeepMind and OpenAI) is developing frontier-scale AI systems for high-impact applications in hu...Show moreLast updated: 22 hours ago

Promoted

Machine Learning Engineer, Training Infrastructure

Ipro Networks Pte. Ltd.San Francisco, CA, United States

Full-time

Job Title : Machine Learning Engineer, Training Infrastructure | Position Type : Full time | Location : San Francisco, CA, USA | Salary Range : $150,000 - $250,000 (USD) | Job ID# : 158135.Design, imple...Show moreLast updated: 30+ days ago

Promoted
New!

AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Compute

AppleSan Francisco, CA, United States

Full-time

Apple is where individual imaginations gather together, committing to the values that lead to great work.Every new product we build, service we create, or Apple Store experience we deliver is the r...Show moreLast updated: 22 hours ago

Promoted
New!

Applied ML / LLM Engineer

PincitesSan Francisco, CA, United States

Full-time

Were looking for a sharp, ambitious.AI-native products someone who knows how to turn messy real-world data into performant models, fine-tune and deploy LLMs, and design feedback loops that make AI ...Show moreLast updated: 22 hours ago

Promoted
New!

Machine Learning Engineer, Distributed Training, Optimus

Tesla Motors, Inc.Palo Alto, California, United States

Full-time

What to Expect As a Software Engineer for the Optimus team, you will build the tools and infrastructure to make and measure improvements to neural network architecture, visualize data, assist with ...Show moreLast updated: 12 hours ago