Training Performance Engineer

OpenAISan Francisco, CA, United States

17 hours ago

Job type

Full-time

Job description

About the Team

Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate researchers and enable frontier scale, we're building a unified, modular runtime that meets researchers where they are and moves with them up the scaling curve.

Our work focuses on three pillars : high-performance, asynchronous, zero-copy tensor and optimizer-state-aware data movement; performant, high-uptime, fault-tolerant training frameworks (training loop, state management, resilient checkpointing, deterministic orchestration, and observability); and distributed process management for long-lived, job-specific and user-provided processes.

We integrate proven large-scale capabilities into a composable, developer-facing runtime so teams can iterate quickly and run reliably at any scale, partnering closely with model-stack, research, and platform teams. Success for us is measured by raising both training throughput (how fast models train) and researcher throughput (how fast ideas become experiments and products).

About the Role

As a Training Performance Engineer, you'll drive efficiency improvements across our distributed training stack. You'll analyze large-scale training runs, identify utilization gaps, and design optimizations that push the boundaries of throughput and uptime. This role blends deep systems understanding with practical performance engineering - analyzing GPU kernel performance, collective communication throughput, investigating I / O bottlenecks, and sharding our models so we can train them at massive scale.

You'll help ensure that our clusters are running at peak performance, enabling OpenAI to train larger, more capable models with the same compute budget.

This role is based in San Francisco, CA. We use a hybrid work model of three days in the office per week and offer relocation assistance to new employees.

In this role, you will :

Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage.

Optimize GPU utilization and throughput for large-scale distributed model training.

Collaborate with runtime and systems engineers to improve kernel efficiency, scheduling, and collective communication performance.

Implement model graph transforms to improve end to end throughput.

Build tooling to monitor and visualize MFU, throughput, and uptime across clusters.

Partner with researchers to ensure new model architectures scale efficiently during pre-training.

Contribute to infrastructure decisions that improve reliability and efficiency of large training jobs.

You might thrive in this role if you :

Love optimizing performance and digging into systems to understand how every layer interacts.

Have strong programming skills in Python and C++ (Rust or CUDA a plus).

Have experience running distributed training jobs on multi-GPU systems or HPC clusters.

Enjoy debugging complex distributed systems and measuring efficiency rigorously.

Have exposure to frameworks like PyTorch, JAX, or TensorFlow and an understanding of how large-scale training loops are built.

Are comfortable collaborating across teams and translating raw profiling data into practical engineering improvements.

Nice to have :

Familiarity with NCCL, MPI, or UCX communication libraries.

Experience with large-scale data loading and checkpointing systems.

Prior work on training runtime, distributed scheduling, or ML compiler optimization.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.

For additional information, please see OpenAI's Affirmative Action and Equal Employment Opportunity Policy Statement.

Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers : we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment : protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.

To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.

We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.

OpenAI Global Applicant Privacy Policy

At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.

Compensation Range : $250K - $460K

Create a job alert for this search

Performance Engineer • San Francisco, CA, United States

Related jobs

Promoted

Performance engineer

Writer CorporationSan Francisco, CA, United States

Full-time

Writer is seeking a highly skilled and motivated Principal Performance Engineer to lead the performance optimization of our cutting-edge Generative AI technology stack. This role is critical in ensu...Show moreLast updated: 3 days ago

Promoted

Refining Reliability Engineer - Instrumentation

Marathon PetroleumMartinez, CA, United States

Full-time

At MPC, we're committed to being a great place to work - one that welcomes new ideas, encourages diverse perspectives, develops our people, and fosters a collaborative team environment.Reliability ...Show moreLast updated: 5 days ago

Promoted

Culinary Specialist

United States ArmyMoss Beach, CA, US

Permanent

As a Culinary Specialist, you'll cook meals and work alongside chefs to prepare meals comparable to any major restaurant, so that Soldiers can sit down and enjoy a hot meal in between training or m...Show moreLast updated: 30+ days ago

Promoted

Certified Performance Coach

ExosMountain View, CA, US

Full-time +1

Join our dynamic team as a Performance Coach in a corporate wellness center setting! We are dedicated to helping our clients achieve their health and wellness goals through customized fitness progr...Show moreLast updated: 26 days ago

Promoted

Performance Engineer - Deep Learning

NVIDIASanta Clara, CA, United States

Full-time

NVIDIA is hiring software engineers at all experience levels to build and optimize the tools Deep Learning engineers use across the world to design, develop, and deploy AI applications.This positio...Show moreLast updated: 30+ days ago

Promoted
New!

Personal Trainer, San Francisco

EquinoxRichmond, CA, US

Full-time

Equinox is seeking talented individuals interested in joining our Personal Training team at our Equinox clubs in San Francisco. This is an exclusive opportunity for Certified Personal Trainers and K...Show moreLast updated: 18 hours ago

Promoted

MEP Systems Engineer

SamaraRedwood City, CA, US

Full-time

Ready to play a key role in building the future of living? Join Samara in tackling California’s housing shortage and enabling people to attain sustainable housing without compromising design ...Show moreLast updated: 30+ days ago

Promoted
New!

Psychiatrist | $375K–$450K+ | Uncapped Earning Potential + Immersive Training in Interventional Psychiatry

Mindful Health SolutionsSan Rafael, CA, US

Full-time

Bring your passion for Psychiatry to the nations leading innovative evidence-based Outpatient Psychiatry practice in the country! Thrive in a patient centric environment with a culture of empathy, ...Show moreLast updated: 18 hours ago

Promoted

Bradley Fighting Vehicle System Maintainer

United States ArmyMoss Beach, CA, US

Permanent

As a Bradley Fighting Vehicle System Maintainer, you'll have the challenging task of performing repairs and maintenance exclusively on the range of Bradley fighting vehicles, including anti-aircraf...Show moreLast updated: 30+ days ago

Promoted

Training Provided – Client Enrollment Specialist

SRO MarketingHayward, CA, US

Full-time

In this Client Enrollment Specialist role with SRO Marketing, you will represent major telecommunications clients in face-to-face settings, enrolling new customers and determining eligibility for s...Show moreLast updated: 30+ days ago

Promoted

CBRN Specialist

United States ArmyMontara, CA, US

Permanent

As a Chemical, Biological, Radiological, and Nuclear Specialist, you'll protect the country against the threat of CBRN weapons of mass destruction, and you'll decontaminate hazardous material spill...Show moreLast updated: 30+ days ago

Promoted

Performance Engineer

AnthropicSan Francisco, CA, United States

Full-time

Anthropic's mission is to create reliable, interpretable, and steerable AI systems.We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group ...Show moreLast updated: 3 days ago

Promoted

Performance Engineer

Menlo VenturesSan Francisco, CA, United States

Full-time

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems.We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group ...Show moreLast updated: 25 days ago

Promoted

Remote Corporate Development Analyst – AI Trainer ($50-$60 / hour)

Data AnnotationBerkeley, California

Remote

Full-time +1

We are looking for a finance professional to join our team to train AI models.You will measure the progress of these AI chatbots, evaluate their logic, and solve problems to improve the quality of ...Show moreLast updated: 15 days ago

Promoted

Performance Coach (ex-consultants / SDRs welcome)

KalosAlameda, CA, US

Full-time

Palo Alto, Santa Clara, or Walnut Creek, CA (In-Person 4x / week).Medical, Dental, Vision (Anthem), Covered Gym Membership. F50E; Who We're Looking For.Are you the person your friends go to for fi...Show moreLast updated: 25 days ago

Promoted

Training : ML Framework Engineer

OpenAISan Francisco, CA, United States

Full-time

Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier‑scale model runs. With a dual mandate to accelerate...Show moreLast updated: 6 days ago

Promoted

Mission Operations Training Specialist

Planet Labs PBCSan Francisco, CA, United States

Full-time

We believe in using space to help life on Earth.Planet designs, builds, and operates the largest constellation of imaging satellites in history. This constellation delivers an unprecedented dataset ...Show moreLast updated: 13 days ago

Promoted

SAP OCM Trainer (Jr)

RED GlobalFremont, CA, US

Full-time

We are looking for a Junior OCM Trainer to support training development and delivery during the SAP Build phase of a Life Sciences implementation program. This position will collaborate with the OCM...Show moreLast updated: 2 days ago