LLM Training Dataset and Checkpoint Optimization EngineerTogether AI • San Francisco, CA, United States

LLM Training Dataset and Checkpoint Optimization Engineer

Together AI • San Francisco, CA, United States

30+ days ago

Job type

Full-time

Job description

About Us

Together.ai is a leader in developing AI infrastructure that powers the training of state-of-the-art models. We focus on creating scalable, efficient systems for handling massive datasets and managing large-scale distributed checkpoints, ensuring seamless workflows for training and fine-tuning AI models.

We are seeking a Training Dataset and Checkpoint Acceleration Engineer to optimize data pipelines and checkpoint mechanisms for large-scale machine learning workloads. In this role, you will work at the intersection of data engineering and distributed systems, ensuring that training workflows are highly performant, reliable, and cost-efficient.

Responsibilities

Dataset Acceleration :

Design and optimize high-throughput data pipelines for streaming and processing massive training datasets.

Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency.

Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph).

Checkpointing Systems :

Build and optimize distributed checkpoint mechanisms for large-scale training workflows.

Implement techniques to minimize checkpoint I / O overhead and ensure fault tolerance.

Develop incremental and differential checkpointing solutions to reduce storage costs.

Performance Optimization :

Profile and debug bottlenecks in data pipelines and checkpoint systems.

Optimize for GPU / TPU utilization by ensuring efficient data feeding and checkpoint recovery times.

Scalability and Reliability :

Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets.

Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs.

Collaboration and Support :

Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements.

Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows.

Qualifications

Must-Have :

Experience :

5+ years of experience in data engineering, distributed systems, or ML infrastructure.

Technical Skills :

Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI).

Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5).

Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS).

Programming :

Proficient in Python, C++, or Go for performance-critical systems.

Optimization Techniques :

Experience with I / O optimization techniques (e.g., asynchronous data loading, prefetching).

Familiarity with compression and serialization for large datasets and checkpoints.

Soft Skills :

Analytical and problem-solving mindset.

Strong communication and collaboration skills across teams.

Nice-to-Have :

Experience with ML frameworks (e.g., PyTorch, TensorFlow, JAX) and distributed training.

Familiarity with hardware accelerators (e.g., GPUs, TPUs) and storage optimizations.

Knowledge of open-source contributions or projects related to data pipelines or checkpointing.

Experience with incremental and real-time checkpointing solutions.

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https : / / www.together.ai / privacy

#J-18808-Ljbffr

Create a job alert for this search

Llm And Optimization • San Francisco, CA, United States

Related jobs

Machine Learning Engineer, Distributed Training, Optimus

Tesla Motors, Inc. • Palo Alto, CA, United States

Full-time

As a Software Engineer for the Optimus team, you will build the tools and infrastructure to make and measure improvements to neural network architecture, visualize data, assist with exporting and d...Show more

Last updated: 30+ days ago • Promoted

Applied ML / LLM Engineer

Pincites • San Francisco, CA, United States

Full-time

We’re looking for a sharp, ambitious.AI-native products — someone who knows how to turn messy real-world data into performant models, fine-tune and deploy LLMs, and design feedback loops that make ...Show more

Last updated: 4 days ago • Promoted

LLM Training Resilience Engineer

Together AI • San Francisco, CA, United States

Full-time

Get AI-powered advice on this job and more exclusive features.AI infrastructure development, creating robust platforms and frameworks to support state-of-the-art large-scale machine learning traini...Show more

Last updated: 30+ days ago • Promoted

Police Officer - New Recruit (Entry Level)

City and County of San Francisco • Moss Beach, CA, US

Full-time +1

Police Officer — New Recruit (Entry-Level).San Francisco Police Department (Q002) | .Full-time, Permanent Civil Service.Comprehensive City & County benefits. Protect life and property through proac...Show more

Last updated: 2 days ago • Promoted

Machine Learning Engineer | Python | Pytorch | Distributed Training | Optimisation | GPU | Hybrid, San Jose, CA

Enigma • San Jose, CA, United States

Full-time

Last updated: 7 days ago • Promoted

LLM Training Frameworks and Optimization Engineer

Together AI • San Francisco, CA, United States

Full-time

LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer. LLM Training Frameworks and Optimization Engine...Show more

Last updated: 30+ days ago • Promoted

Senior LLM Engineer

Conviva • Foster City, CA, United States

Full-time

Conviva is the first and best place to go to understand and optimize digital customer experiences.Our Operational Data Platform harnesses full-census, comprehensive client-side telemetry—capturing ...Show more

Last updated: 30+ days ago • Promoted

IRL Tech III

KA Recruiting Inc. • Orinda, CA, US

Full-time +1

IRL Tech III openings at a beautiful facility in the Oakland, CA area.MANY different shift options available.If you are interested in learning more, or if you are a healthcare profes...Show more

Last updated: 30+ days ago • Promoted

Product Development Engineer, Reagents

Bruker • Emeryville, CA, United States

Full-time +1

Product Development Engineer, Reagents.Bruker is enabling scientists to make breakthrough discoveries and develop new applications that improve the quality of human life. Bruker's high-performance s...Show more

Last updated: 15 days ago • Promoted

LLM Training Frameworks and Optimization Engineer

Together • San Francisco, CA, United States

Full-time

LLM Training Frameworks and Optimization Engineer.We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost‑efficien...Show more

Last updated: 4 days ago • Promoted

Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

Socotra, Inc. • San Francisco, CA, United States

Full-time

Build the Future of Scalable AI at TrueFoundry.ML teams train, deploy, and scale their models.Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models rel...Show more

Last updated: 30+ days ago • Promoted

Lab Systems Training & Adoption Lead

Concord • Fremont, CA, US

Full-time

Location : Bay Area, CA (On-site).Might require occasional travel to Santa Clara and Tucson.Employment Type : Contract (W2 or C2C). Possibility of renewal depending on personal performance and busines...Show more

Last updated: 1 day ago • Promoted

Training : ML Framework Engineer

OpenAI • San Francisco, CA, United States

Full-time

Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier‑scale model runs. With a dual mandate to accelerate...Show more

Last updated: 21 days ago • Promoted

Machine Learning Engineer - Post Training

EPM Scientific • San Francisco County, CA, United States

Full-time

Machine Learning Engineer - Post Training.A stealth-stage venture backed by Lux Capital (investors in DeepMind and OpenAI) is developing frontier-scale AI systems for high-impact applications in hu...Show more

Last updated: 15 days ago • Promoted

Tech Lead Manager- MLRE, ML Systems

Scale AI, Inc. • San Francisco, CA, United States

Full-time

Scale's LLM post-training platform team builds our internal distributed framework for large language model training.The platform powers MLEs, researchers, data scientists, and operators for fast an...Show more

Last updated: 30+ days ago • Promoted

Lecturer - Information and Cybersecurity - School of Information

InsideHigherEd • Berkeley, California, United States

Full-time +1

Lecturer - Information and Cybersecurity - School of Information.The starting, full-time equivalent annual salary rate is currently $140,169. Appointments are typically from one to three sections pe...Show more

Last updated: 27 days ago • Promoted

Qualified Moms Wanted : Healthy Prior Delivery + Age 21–36 (Surrogacy $50k–$100k)

Ivy Surrogacy • Montara, CA, US

Full-time +1

Becoming a surrogate mother is one of the greatest gifts of life!.Ivy Surrogacy is a third-party reproductive agency for parents all over the world seeking. At Ivy Surrogacy, we genuinely believe we...Show more

Last updated: 1 day ago • Promoted

Peer Recovery Coach

Telecare Corporation • Redwood City, CA, United States

Full-time +2

We have over 300 Peer roles at Telecare.We value this lived experience and this is what we are trying to grow within the organization. We have a career ladder specific to our Peer Workforce.What You...Show more

Last updated: 14 days ago • Promoted