LLM Training Dataset and Checkpoint Optimization EngineerTogether AI • San Francisco, CA, United States

LLM Training Dataset and Checkpoint Optimization Engineer

Together AI • San Francisco, CA, United States

30+ days ago

Job type

Full-time

Job description

About Us

Together.ai is a leader in developing AI infrastructure that powers the training of state-of-the-art models. We focus on creating scalable, efficient systems for handling massive datasets and managing large-scale distributed checkpoints, ensuring seamless workflows for training and fine-tuning AI models.

We are seeking a Training Dataset and Checkpoint Acceleration Engineer to optimize data pipelines and checkpoint mechanisms for large-scale machine learning workloads. In this role, you will work at the intersection of data engineering and distributed systems, ensuring that training workflows are highly performant, reliable, and cost-efficient.

Responsibilities

Dataset Acceleration :

Design and optimize high-throughput data pipelines for streaming and processing massive training datasets.

Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency.

Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph).

Checkpointing Systems :

Build and optimize distributed checkpoint mechanisms for large-scale training workflows.

Implement techniques to minimize checkpoint I / O overhead and ensure fault tolerance.

Develop incremental and differential checkpointing solutions to reduce storage costs.

Performance Optimization :

Profile and debug bottlenecks in data pipelines and checkpoint systems.

Optimize for GPU / TPU utilization by ensuring efficient data feeding and checkpoint recovery times.

Scalability and Reliability :

Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets.

Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs.

Collaboration and Support :

Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements.

Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows.

Qualifications

Must-Have :

Experience :

5+ years of experience in data engineering, distributed systems, or ML infrastructure.

Technical Skills :

Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI).

Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5).

Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS).

Programming :

Proficient in Python, C++, or Go for performance-critical systems.

Optimization Techniques :

Experience with I / O optimization techniques (e.g., asynchronous data loading, prefetching).

Familiarity with compression and serialization for large datasets and checkpoints.

Soft Skills :

Analytical and problem-solving mindset.

Strong communication and collaboration skills across teams.

Nice-to-Have :

Experience with ML frameworks (e.g., PyTorch, TensorFlow, JAX) and distributed training.

Familiarity with hardware accelerators (e.g., GPUs, TPUs) and storage optimizations.

Knowledge of open-source contributions or projects related to data pipelines or checkpointing.

Experience with incremental and real-time checkpointing solutions.

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https : / / www.together.ai / privacy

#J-18808-Ljbffr

Create a job alert for this search

Llm And Optimization • San Francisco, CA, United States

Related jobs

Distributed LLM Inference Engineer

Anyscale, Inc • San Francisco, CA, United States

Full-time

At Anyscale, we're on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We're commercializing Ray, a popular open-source project that'...Show more

Last updated: 30+ days ago • Promoted

Machine Learning Engineer, Training Infrastructure

Intellipro Group • San Francisco, CA, United States

Full-time

Machine Learning Engineer, Training Infrastructure.We are looking for an ML Engineer with 3+ YOE in high-performance computing systems to manage and optimize our computational infrastructure for tr...Show more

Last updated: 18 days ago • Promoted

ML Research Engineer - Training

Achira • San Francisco, CA, United States

Full-time

Join a world‑class team of scientists, ML researchers, and engineers working together to make the physical microcosm predictable and reshape the future of drug discovery. Move beyond the beaten path...Show more

Last updated: 30+ days ago • Promoted

MLE, ML Platform

zaimler • San Mateo, CA, United States

Full-time

We're creating the foundation for AI systems that don't just generate, but retrieve, link, and reason over enterprise knowledge. In just over a year, we've begun partnering with Fortune 500 design p...Show more

Last updated: 18 days ago • Promoted

Training : ML Framework Engineer

OpenAI • San Francisco, CA, United States

Full-time

Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate...Show more

Last updated: 18 days ago • Promoted

Reinforcement Learning Engineer

Code Metal • San Francisco, CA, United States

Full-time

At Code Metal AI, you'll be part of a world class team with talent from MIT, OpenAI and other top companies, focused on pioneering work in large language models (LLMs) and code generation.Our proje...Show more

Last updated: 10 days ago • Promoted

Machine Learning Infrastructure Engineer

Saxon Global • Menlo Park, CA, United States

Full-time

Strong foundation in machine learning, deep learning, and computer vision.Experience with distributed systems and scalable ML infrastructure. Proficient in Python and software development best pract...Show more

Last updated: 15 days ago • Promoted

Applied AI / ML Engineer

Catalyst Labs, LLC • Menlo Park, CA, United States

Full-time

About the job Applied AI / ML Engineer.Catalyst Labs is a leading talent agency with a specialized vertical in Applied AI, Machine Learning, and Data Science. We stand out as an agency thats deeply ...Show more

Last updated: 18 days ago • Promoted

Distributed Training Engineer

Periodic Labs • Menlo Park, CA, United States

Full-time

We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries.We are well funded and growing rapidly. Team members are owners who identity and solve prob...Show more

Last updated: 17 days ago • Promoted

Machine Learning Engineer, Training Infrastructure

HEDRA INC • San Francisco, CA, United States

Full-time

Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show more

Last updated: 30+ days ago • Promoted

Software Engineer (Technical Leadership) - Machine Learning Specialist

META • Menlo Park, CA, United States

Full-time

Meta is seeking Machine Learning Engineers to join our engineering team.The ideal candidate will have industry experience working on a range of classification and optimization problems like payment...Show more

Last updated: 30+ days ago • Promoted

NLP Engineer - Production ML for PII Redaction (Remote)

TonicAI • San Francisco, CA, United States

Remote

Full-time

A leading data privacy firm in San Francisco is seeking a hands-on Machine Learning Engineer to develop production-grade NLP systems. The ideal candidate will have over 3 years of experience in appl...Show more

Last updated: 1 hour ago • Promoted • New!

AI Engineer - LLM Infra

Yutori • San Francisco, CA, United States

Full-time

Yutori is reimagining how people interact with the web by building AI agents that can reliably do everyday digital tasks. We are building the entire stack to be agent-first, from training our own mo...Show more

Last updated: 30+ days ago • Promoted

Staff Machine Learning Engineer, ML Performance & Optimization

Waymo • San Francisco, CA, United States

Full-time

Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show more

Last updated: 18 days ago • Promoted

AI Engineer, Multimodal LLMs

Eloquent AI, Inc. • San Francisco, CA, United States

Full-time

At Eloquent AI, we build reliable, high-performance AI agents that solve real-world problems.Our mission is to create precise, scalable AI solutions that businesses can depend on.With a culture of ...Show more

Last updated: 10 days ago • Promoted

Lead Machine Learning Engineer

Pubmatic • Redwood City, CA, United States

Full-time

We are immediately hiring a results-oriented.Reporting to the Director of Machine Learning, you will partner with Product and Engineering teams to both solve problems and identify new opportunities...Show more

Last updated: 30+ days ago • Promoted

Lecturer - Information and Cybersecurity - School of Information

InsideHigherEd • Berkeley, California, United States

Full-time +1

Lecturer - Information and Cybersecurity - School of Information.The starting, full-time equivalent annual salary rate is currently $140,169. Appointments are typically from one to three sections pe...Show more

Last updated: 30+ days ago • Promoted

Software Engineer, Training & Inference Infrastructure

datologyai • Redwood City, CA, United States

Full-time

But a large portion of training compute is wasted training on data that are already learned, irrelevant, or even harmful, leading to worse models that cost more to train and deploy.At DatologyAI, w...Show more

Last updated: 30+ days ago • Promoted