LLM Training Frameworks and Optimization Engineer

Together AISan Francisco, CA, United States

30+ days ago

Job type

Full-time

Job description

LLM Training Frameworks and Optimization Engineer

Join to apply for the LLM Training Frameworks and Optimization Engineer role at Together AI

LLM Training Frameworks and Optimization Engineer

Join to apply for the LLM Training Frameworks and Optimization Engineer role at Together AI

Role

At Together.ai, we are building cutting-edge infrastructure to enable efficient and scalable training of large language models (LLMs). We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost-efficiency.

Role

We are seeking a LLM Training Frameworks and Optimization Engineer to drive innovations in the development and optimization of distributed training frameworks. In this role, you will ensure that our LLM training pipelines are robust, efficient, and capable of handling the complexities of large-scale distributed systems.

Responsibilities

Framework Development and Optimization :
Design, implement, and optimize distributed training frameworks tailored for large language models.
Develop custom modules, plugins, and features to enhance framework scalability and performance.
Algorithmic and Systems Optimization :
Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training.
Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training.
Performance Tuning :
Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks.
Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators.
Scalability and Resilience :
Ensure training systems scale efficiently to thousands of nodes and petabytes of data.
Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines.
Collaboration and Support :
Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements.
Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle.

Requirements

Must-Have :

Experience :

5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure.

Technical Skills :

Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA).

Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism).

Familiarity with GPU / TPU hardware and deep learning performance optimizations.

Programming :

Proficient in Python and C++ or CUDA for high-performance computing.

Optimization Techniques :

Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding).

Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization.

Soft Skills :

Analytical problem-solving skills and a focus on performance improvement.

Strong collaboration and communication skills across teams.

Nice-to-Have

Familiarity with graph optimization and compiler-level performance tuning.

Contributions to open-source deep learning or distributed training projects.

Experience with low-level hardware optimizations (e.g., kernel fusion, custom CUDA kernels).

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https : / / www.together.ai / privacy

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Engineering and Information Technology

Industries

Software Development

Referrals increase your chances of interviewing at Together AI by 2x

San Francisco, CA $167,000.00-$185,500.00 6 days ago

San Francisco, CA $130,000.00-$145,000.00 2 weeks ago

Staff Optimization Engineer, Dynamic Pricing

San Francisco, CA $223,000.00-$248,000.00 13 hours ago

San Francisco, CA $120,000.00-$180,000.00 4 months ago

Machine Learning Engineer, Forecast Platform

San Francisco, CA $198,000.00-$220,000.00 5 days ago

Machine Learning Engineer II - Autonomous Mobility and Delivery

San Francisco, CA $167,000.00-$185,500.00 3 days ago

Oakland, CA $90,000.00-$122,000.00 12 hours ago

San Francisco, CA $120,000.00-$160,000.00 2 weeks ago

San Francisco, CA $217,400.00-$294,100.00 14 hours ago

San Francisco, CA $209,700.00-$283,800.00 14 hours ago

GenAI Staff Machine Learning Engineer, Performance Optimization

San Francisco, CA $149,998.00-$250,000.00 9 months ago

San Francisco, CA $117,000.00-$150,000.00 1 month ago

Process Engineer, application via RippleMatch

San Francisco, CA $75,000.00-$150,003.00 10 months ago

San Francisco, CA $117,000.00-$150,000.00 3 weeks ago

Software Engineer, Performance Optimization

Redwood City, CA $175,000.00-$220,000.00 1 month ago

Process Engineer, application via RippleMatch

Redwood City, CA $142,000.00-$158,000.00 3 weeks ago

Staff Deep Learning Engineer, Perception

San Francisco, CA $193,375.00-$227,500.00 5 months ago

San Mateo, CA $233,840.00-$283,780.00 3 days ago

San Francisco, CA $100,000.00-$150,000.00 2 weeks ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr

Create a job alert for this search

Llm Engineer • San Francisco, CA, United States

Related jobs

Promoted

LLM Training Dataset and Checkpoint Optimization Engineer

Together AISan Francisco, CA, United States

Full-time

AI infrastructure that powers the training of state-of-the-art models.We focus on creating scalable, efficient systems for handling massive datasets and managing large-scale distributed checkpoints...Show moreLast updated: 30+ days ago

Promoted

Personal Trainer

Svetness Personal TrainingMoss Beach, CA, US

Full-time

Are you a dedicated and motivated personal trainer looking to make a significant impact on individuals' lives? We are actively seeking exceptional trainers to join our team and deliver personal...Show moreLast updated: 30+ days ago

Promoted

LLM Training Resilience Engineer

Together AISan Francisco, CA, United States

Full-time

AI infrastructure development, creating robust platforms and frameworks to support state-of-the-art large-scale machine learning training. We specialize in delivering resilient, high-performance sys...Show moreLast updated: 30+ days ago

Promoted

Applied ML / LLM Engineer

PincitesSan Francisco, CA, United States

Full-time

We’re looking for a sharp, ambitious.AI-native products — someone who knows how to turn messy real-world data into performant models, fine-tune and deploy LLMs, and design feedback loops that make ...Show moreLast updated: 4 days ago

Promoted

ML Research Engineer - Training

AchiraSan Francisco, CA, United States

Full-time

Join a world‑class team of scientists, ML researchers, and engineers working together to make the physical microcosm predictable and reshape the future of drug discovery. Move beyond the beaten path...Show moreLast updated: 30+ days ago

Promoted
New!

Reinforcement Learning Engineer

Code MetalSan Francisco, CA, United States

Full-time

At Code Metal AI, you’ll be part of a world class team with talent from MIT, OpenAI and other top companies, focused on pioneering work in large language models (LLMs) and code generation.Our proje...Show moreLast updated: 18 hours ago

Promoted

Senior LLM Engineer

ConvivaFoster City, CA, United States

Full-time

Conviva is the first and best place to go to understand and optimize digital customer experiences.Our Operational Data Platform harnesses full-census, comprehensive client-side telemetry—capturing ...Show moreLast updated: 30+ days ago

Promoted
New!

Travel Nurse RN - LDRP - Labor Delivery Recovery & Postpartum

Jackson Nurse ProfessionalsGreenbrae, CA, US

Full-time

Jackson Nurse Professionals is seeking a travel nurse RN LDRP - Labor Delivery Recovery & Postpartum for a travel nursing job in Greenbrae, California. Job Description & Requirements.LDRP - ...Show moreLast updated: 5 hours ago

Promoted

Indoor Cycling Instructor

UFC GymConcord, CA, US

Full-time

We Empower the Fighting Spirit in You!.Comprehensive health benefits : .Full coverage for medical, dental, and vision.Complimentary access to all our fitness centers. Employee discounts and special of...Show moreLast updated: 1 day ago

Promoted

LLM Training Frameworks and Optimization Engineer

TogetherSan Francisco, CA, United States

Full-time

LLM Training Frameworks and Optimization Engineer.We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost‑efficien...Show moreLast updated: 4 days ago

Promoted

Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

Socotra, Inc.San Francisco, CA, United States

Full-time

Build the Future of Scalable AI at TrueFoundry.ML teams train, deploy, and scale their models.Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models rel...Show moreLast updated: 30+ days ago

Promoted

Senior Software Engineer, ML Training Platform

DoorDashSan Francisco, California, United States

Full-time

DoorDash is building the world’s most reliable on-demand logistics engine.Behind the scenes, our Machine Learning Platform (MLP) powers critical real-time decision-making for millions of orders eac...Show moreLast updated: 1 day ago

Promoted

Machine Learning Engineer - Post Training

EPM ScientificSan Francisco, CA, United States

Full-time

Vice President – AI / ML at EPM Scientific.A stealth‑stage venture backed by Lux Capital (investors in DeepMind and OpenAI) is developing frontier‑scale AI systems for high‑impact applications in hum...Show moreLast updated: 18 days ago

Promoted

Travel Nurse RN - LDRP - Labor Delivery Recovery & Postpartum

LanceSoftGreenbrae, CA, US

Permanent

LanceSoft is seeking a travel nurse RN LDRP - Labor Delivery Recovery & Postpartum for a travel nursing job in Greenbrae, California. Job Description & Requirements.LDRP - Labor Delivery Rec...Show moreLast updated: 30+ days ago

Promoted

Training : ML Framework Engineer

OpenAISan Francisco, CA, United States

Full-time

Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier‑scale model runs. With a dual mandate to accelerate...Show moreLast updated: 21 days ago

Promoted

Tech Lead Manager- MLRE, ML Systems

Scale AI, Inc.San Francisco, CA, United States

Full-time

Scale's LLM post-training platform team builds our internal distributed framework for large language model training.The platform powers MLEs, researchers, data scientists, and operators for fast an...Show moreLast updated: 30+ days ago

Promoted

Lecturer - Information and Cybersecurity - School of Information

InsideHigherEdBerkeley, California, United States

Full-time +1

Lecturer - Information and Cybersecurity - School of Information.The starting, full-time equivalent annual salary rate is currently $140,169. Appointments are typically from one to three sections pe...Show moreLast updated: 27 days ago

Promoted

Peer Recovery Coach

Telecare CorporationRedwood City, CA, United States

Full-time +2

We have over 300 Peer roles at Telecare.We value this lived experience and this is what we are trying to grow within the organization. We have a career ladder specific to our Peer Workforce.What You...Show moreLast updated: 13 days ago