Talent.com
Research Scientist - Distributed Machine Learning
Research Scientist - Distributed Machine LearningInstitute of Foundation Models • Sunnyvale, CA, United States
Research Scientist - Distributed Machine Learning

Research Scientist - Distributed Machine Learning

Institute of Foundation Models • Sunnyvale, CA, United States
Hace más de 30 días
Tipo de contrato
  • A tiempo completo
Descripción del trabajo

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you'll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

Role Overview

Build and scale distributed pre-training frameworks

• Set up DeepSpeed / FSDP / Megatron-LM across multi-node GPU clusters.

• Create robust launch scripts, resilient checkpoints, and job monitoring (e.g. NCCL/GLOO/GPU).

Turn mathematical ideas into fast production code

• Prototype new optimizers or attention methods (like in PyTorch/NumPy/JAX orothers).

• Convert them into efficient CUDA/Triton kernels with custom gradients and tests.

Boost training efficiency and stability

• Lead mixed-precision training, push bf16, fp8, etc, into daily runs, track their accuracy-vs-speed gains, and be able to analyze numeric stability

• Apply kernel fusion, communication tuning, and memory optimization to reach state-of-the-art throughput.

Accelerate research velocity

• Build logging, metrics, and other experiment-tracking tools for rapid iteration.

• Design ablation studies and statistical tests that validate-or refute-new ideas.

• Mentor interns and junior engineers through clear async design docs and code reviews.

You'll work side-by-side with researchers, ship production code, and shape the future of large language models.

Why You'll Love This Job

Frontier-scale impact - Train and ship cutting-edge models powering MBZUAI research and industry collaborations.

Research x Engineering blend - Move breakthrough papers into real systems and publish your own results.

End-to-end mastery - Touch everything from petabyte data loaders to custom low-level kernels-experience that's rare elsewhere.

Open, mission-driven science - Join a transparent culture tackling problems that truly advance AI.

Founding-team growth - Help set direction for IFM U.S. and lead the next generation of AI development.

Key Responsibilities

Framework Ownership - Productionize a PyTorch/JAX pre-training stack and keep it reliable at scale.

Custom Optimizer Implementation - Code new algorithms in distributed frameworks directly from mathematical specs.

Experiment Infrastructure - Build reusable modules, logging, and metrics dashboards that speed up research cycles.

Performance Optimization - Apply kernel fusion, communication optimization, and memory management to thousands of GPU jobs.

Distributed Debugging - Rapidly diagnose gradient synchronization, collective-ops, or fault-tolerance issues.

Collaboration - Document designs clearly, run post-mortems, and partner with global research teams.

Qualifications

Must-Haves

5 + years combined industry or hands-on research experience with large-scale deep-learning training.

• Led at least one large-scale transformer pre-training run

• Expert PyTorch or JAX/Flax plus DeepSpeed, FSDP, Megatron-LM, or MosaicML Composer.

• Experience with distributed training at scale (100+ GPUs).

• Proven multi-node GPU work (Slurm, K8s, or Ray) and NCCL/GLOO debugging.

• Strong software engineering skills on large ML codebases

• Ownership of mixed- or low-precision paths (bf16, fp8, 4-bit) with accuracy validation.

• Clear written communication (design docs, RFCs, post-mortems).

Nice-to-Haves

• NeurIPS / ICML / ICLR papers or open-source contributions to major ML frameworks.

• Experience implementing optimization algorithms (e.g., SGD variants, Adam, second-order methods).

• Background in numerical computing.

• Ability to translate math and

• build high-perf CUDA/Triton kernels.

$150,000 - $450,000 a year

Total compensation target:$300 K - $600 K (base salary + target bonus up to 30 %), commensurate with experience.

Visa Sponsorship

This position is eligible for visa sponsorship.

Benefits Include

*Comprehensive medical, dental, and vision benefits

*Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability

Crear una alerta de empleo para esta búsqueda

Research Scientist - Distributed Machine Learning • Sunnyvale, CA, United States

Ofertas similares

Research Scientist: 25-04150 (No C2C)

Akraya IncSunnyvale, California, United States
A tiempo completo
Quick Apply

Primary Skills: Optical Metrology (Expert), Vision Science (Proficient), Python(Advanced), Data Analysis (Expert), Statistical Analysis (Advanced).Duration: 12 Months with possible extemsion.Locati...Mostrar más

Scientist Intern

Bio-Rad LaboratoriesPleasanton, California, United States
A tiempo completo

Bio-Rad Laboratories is a global leader in developing, manufacturing, and marketing innovative products for the life science research and clinical diagnostics markets.At the Digital Biology Center,...Mostrar más

 • Oferta promocionada

Clinical Lab Scientist I - Genomics & MRD Insights

Personalis, IncFremont, California, United States
A tiempo completo

A medical testing company based in California is seeking a Clinical Laboratory Scientist.The successful candidate will supervise lab activities, ensure compliance with quality management, and train...Mostrar más

 • Oferta promocionada

AI Benchmark & Datasets Engineer/ Researcher Internship

PathwayPalo Alto, CA, US
Indefinido +2
Quick Apply

Pathway builds the first post-transformer frontier model that solves AI's fundamental memory problem.While transformers wake up in the same state every time—like Groundhog Day—our architecture enab...Mostrar más

Fisheries Collaborative Program Project Scientist Pool

University of California - Santa CruzSanta Cruz, California, United States
A tiempo completo

Fisheries Collaborative Program Assistant Project Scientist, Associate Project Scientist, and Project Scientist .Commensurate with qualifications and experience.The posted UCSC Salary Scales set th...Mostrar más

 • Oferta promocionada

Staff Level Biologist - Natural Resources (On Call - Variable Hours)

SWCA Environmental ConsultantsSanta Cruz, California, United States
A tiempo completo

SWCA is expanding our natural resources team and seeking On Call.Candidates located in California's Central Coast area are encouraged to apply to support the volume of infrastructure projects in th...Mostrar más

 • Oferta promocionada

Sr. Staff Machine Learning Engineer (DLP)

Palo Alto NetworksSanta Clara, CA, United States
A tiempo completo

At Palo Alto Networks® everything starts and ends with our mission: being the cybersecurity partner of choice, protecting our digital way of life.Our vision is a world where each day is safer and m...Mostrar más

 • Oferta promocionada

Medical Scientific Liaison (Neurology)

Dawar Consulting, Inc.Pleasanton, CA, us
Temporal
Quick Apply

Our client, a world leader in biotechnology and life sciences, is looking for a “.Long Term Contract (Possibility Of Extension).Medical, Dental, Vision, Paid Sick leave, 401K.In this role, you will...Mostrar más

User Experience Researcher (Mixed Methods)

SwoonHayward, California, US
A tiempo completo +1

Mixed Methods UX Research Contractor |.Evaluative and Competitive Research.Do you have the following skills, experience and drive to succeed in this role Find out below.Contract Role - 3 months w/ ...Mostrar más

 • Oferta promocionada • Nueva oferta

Remote AI Engineer & Machine Learning Researcher (Inference) - Speechify Inc. Speechify Inc #1 [...]

WorkinvirtualPalo Alto, CA, United States
Teletrabajo
A tiempo completo

About SpeechifySpeechify is one of the world's leading AI-powered text-to-speech platforms , helping over 50 million users transform reading into listening.From PDFs, eBooks, Google Docs, and news ...Mostrar más

 • Oferta promocionada

Research Scientist

AccuraGenMilpitas, CA, US
A tiempo completo
Quick Apply

We are seeking a highly motivated .This role offers an excellent opportunity for a hands-on molecular biologist to apply deep technical expertise in a dynamic, fast-growing biotechnology company.Th...Mostrar más

Scientist Business Development Antibody Platform Licensing

Green Life ScienceHayward, California, US
A tiempo completo

Business Development Manager – Antibody Platform Licensing.Below, you will find a complete breakdown of everything required of potential candidates, as well as how to apply Good luck.A rapidly grow...Mostrar más

 • Oferta promocionada • Nueva oferta

Clinical Laboratory Scientist: Precision Diagnostics & QA

Dignity HealthSanta Cruz, California, United States
A tiempo completo

A healthcare provider in Santa Cruz, California, seeks a Clinical Laboratory Scientist responsible for performing various technical analyses and ensuring the accuracy of tests.The ideal candidate w...Mostrar más

 • Oferta promocionada

Project Level Biologist - Natural Resources (On Call - Variable Hours)

SWCA Environmental ConsultantsSanta Cruz, California, United States
Temporal

SWCA Environmental Consultants is seeking a.We are seeking a candidate who is looking to develop or continue a long-term consulting career that combines high-level technical expertise, business acu...Mostrar más

 • Oferta promocionada

Artificial Intelligence Researcher

AtomicworkHayward, California, US
A tiempo completo

Engineering team to design, build, evaluate, and deploy intelligent AI systems that power enterprise workflows, autonomous assistants, and machine-assisted task automation.This role is based out of...Mostrar más

 • Oferta promocionada

Remote Macroeconomic Modeling Specialist (EViews)

Micro1Capitola, California, US
Teletrabajo
A tiempo completo

Macroeconomic Modeling Specialist (EViews).AI labs train foundational models and enterprises build AI agents.We provide frontier evaluations and reinforcement learning environments used to improve ...Mostrar más

 • Oferta promocionada

Clinical Laboratory Scientist - Precision Diagnostics & QA

National Association of Latino Healthcare ExecutivesFremont, California, United States
A tiempo completo

The National Association of Latino Healthcare Executives is seeking a laboratory technician to perform pre-analytical, analytical, and post-analytical procedures.Responsibilities include verifying ...Mostrar más

 • Oferta promocionada

Clinical Lab Scientist 1 (3rd Shift)

Personalis, IncFremont, California, United States
A tiempo completo

At Personalis, we are transforming the active management of cancer through breakthrough personalized testing.We aim to drive a new paradigm for cancer management, guiding care from biopsy through t...Mostrar más