No longer accepting applications

LLM Training Frameworks and Optimization Engineer

Together AISan Francisco, CA, US

1 day ago

Job type

Full-time

Job description

About Us

At Together.ai, we are building cutting-edge infrastructure to enable efficient and scalable training of large language models (LLMs). We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost-efficiency.

We are seeking a LLM Training Frameworks and Optimization Engineer to drive innovations in the development and optimization of distributed training frameworks. In this role, you will ensure that our LLM training pipelines are robust, efficient, and capable of handling the complexities of large-scale distributed systems.

Responsibilities

Framework Development and Optimization :

Design, implement, and optimize distributed training frameworks tailored for large language models.

Develop custom modules, plugins, and features to enhance framework scalability and performance.

Algorithmic and Systems Optimization :

Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training.

Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training.

Performance Tuning :

Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks.

Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators.

Scalability and Resilience :

Ensure training systems scale efficiently to thousands of nodes and petabytes of data.

Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines.

Collaboration and Support :

Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements.

Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle.

Qualifications

Must-Have :

Experience :

5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure.

Technical Skills :

Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA).

Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism).

Familiarity with GPU / TPU hardware and deep learning performance optimizations.

Programming :

Proficient in Python and C++ or CUDA for high-performance computing.

Optimization Techniques :

Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding).

Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization.

Soft Skills :

Analytical problem-solving skills and a focus on performance improvement.

Strong collaboration and communication skills across teams.

Nice-to-Have :

Familiarity with graph optimization and compiler-level performance tuning.

Contributions to open-source deep learning or distributed training projects.

Experience with low-level hardware optimizations (e.g., kernel fusion, custom CUDA kernels).

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https : / / www.together.ai / privacy

#J-18808-Ljbffr

Create a job alert for this search

Optimization Engineer • San Francisco, CA, US

Related jobs

Promoted

Principal LLM Application Engineer

AllyNd PartnersPalo Alto, CA, United States

Full-time

About the job Principal LLM Application Engineer.AllyNd's client is driving SOC transformation with its unique application of AI computing, initially focusing on generative AI-powered proactive thr...Show moreLast updated: 3 days ago

Promoted

Divisional Training Lead

KLAMilpitas, CA, United States

Full-time

KLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem.Virtually every electronic device in the world is produced using our technologies.No laptop, smartpho...Show moreLast updated: 29 days ago

Promoted
New!

Learning and Development Lead

VirtualVocationsHayward, California, United States

Full-time

A company is looking for a Learning and Development Instructional Design Lead to oversee global learning initiatives.Key Responsibilities Collaborate with cross-functional teams to design and imp...Show moreLast updated: 20 hours ago

Promoted

Machine Learning Engineer, GenAI Applied ML

Scale AI, Inc.San Francisco, CA, United States

Full-time

At Scale AI, our mission is to accelerate the development of AI applications.For 8 years, Scale has been the leading AI data foundry, helping fuel the most exciting advancements in AI, including : g...Show moreLast updated: 30+ days ago

Promoted

Technical Lead Manager, LLM / VLM Foundation Model

WaymoMountain View, CA, United States

Full-time

Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show moreLast updated: 3 days ago

Promoted

Training Lead

Tech DigitalSan Francisco, CA, US

Full-time

Slack GTM Excellence Enablement Business Partner.Reporting into the Slack GTM Excellence organization, this person will act as the one of the key enablement business partners, aligned to Slack Sell...Show moreLast updated: 30+ days ago

Promoted

Software Engineer, ML Infrastructure - Training Platform

Scale AI, Inc.San Francisco, CA, United States

Full-time

Scale is looking for an AI / ML Infrastructure Engineer to join our Machine Learning Infrastructure team to build out our Training Platform. You will partner closely with Machine Learning researchers ...Show moreLast updated: 30+ days ago

Promoted

Technical Digital PLM Consultant

VirtualVocationsSan Francisco, California, United States

Full-time

A company is looking for a Technical Digital PLM Senior Consultant.Key Responsibilities Collaborate with client IT and business teams to understand systems landscape and integration needs Transl...Show moreLast updated: 2 days ago

Promoted

Technical Lead, ML Training Infrastructure

NuroMountain View, CA, United States

Full-time

Nuro is a self-driving technology company on a mission to make autonomy accessible to all.Founded in 2016, Nuro is building the world's most scalable driver, combining cutting-edge AI with automoti...Show moreLast updated: 3 days ago

Promoted

Technical Lead Manager, ML Training Infrastructure

NuroMountain View, CA, United States

Full-time

Promoted

Engineering Manager, ML Training Platform

ZooxSan Mateo, CA, US

Full-time

Engineering Manager, ML Training Platform.Software Software & Machine Learning Infrastructure / Full-time / Hybrid.Zoox is on a mission to reimagine transportation and ground-up build autonomous r...Show moreLast updated: 30+ days ago

Promoted

LMS Training Administrator

VirtualVocationsFremont, California, United States

Full-time

A company is looking for a Veeva LMS Training Administrator.Key Responsibilities Manage and administer the Veeva LMS, including user setup, course assignments, and reporting Collaborate with dep...Show moreLast updated: 1 day ago

Promoted

Senior MLOps Engineer

VirtualVocationsSan Francisco, California, United States

Full-time

A company is looking for a Senior MLOps Engineer to design and scale infrastructure for AI research and product development. Key Responsibilities Identify and resolve infrastructure and software b...Show moreLast updated: 30+ days ago

Promoted

Senior Manager, Learning Operations

VirtualVocationsFremont, California, United States

Full-time

A company is looking for a Senior Manager, Learning & Technology Operations & Compliance.Key Responsibilities Oversee the learning management system (Docebo), managing content governance and user...Show moreLast updated: 1 day ago

Promoted

LMS Specialist

Robert HalfSan Francisco, CA, US

Full-time

We are transitioning from Absorb LMS to Skilljar (now Gainsight) and are seeking an LMS Implementation Specialist to support our migration and implementation efforts. With approximately 25,000 learn...Show moreLast updated: 9 days ago

Promoted

Training Manager

Calyxo, Inc.Pleasanton, CA, United States

Full-time

The company was founded in 2016 to address the profound need for improved kidney stone treatment.Kidney stone disease is a common, painful condition that consumes vast amounts of healthcare resourc...Show moreLast updated: 7 days ago

Promoted

QA and Training Lead

VirtualVocationsOakland, California, United States

Full-time

A company is looking for a QA & Training Lead (Customer Support).Key Responsibilities Define and implement QA frameworks, policies, and best practices for compliance with gambling regulations Le...Show moreLast updated: 2 days ago

Promoted

AI / ML Engineer (LLM Optimization & AI-Driven Workflows)

BiomichealthSan Francisco, CA, United States

Full-time

Legion Health | AI-Driven Psychiatric Care – We’re Hiring!.Join us in building the most efficient, AI-powered mental healthcare system. We’re a YC-backed company revolutionizing telepsychiatry—not w...Show moreLast updated: 20 days ago

LLM or GenAI Application Engineer

FocusKPI Inc.Mountain View, CA, US

Temporary

Quick Apply

FocusKPI is looking for an LLM or GenAI Application Engineer to join one of our clients, a high-tech SaaS company.An LLM or GenAI Application Engineer or LLM Research Engineer role ...Show moreLast updated: 17 days ago

Promoted

Training and Onboarding Specialist

VirtualVocationsSan Jose, California, United States

Full-time

A company is looking for a Training & Onboarding Specialist to assist customers during their onboarding process and ensure successful integration of the platform. Key Responsibilities Facilitate p...Show moreLast updated: 2 days ago