Talent.com
LLM Training Dataset and Checkpoint Optimization Engineer
LLM Training Dataset and Checkpoint Optimization EngineerTogether AI • San Francisco, CA, United States
LLM Training Dataset and Checkpoint Optimization Engineer

LLM Training Dataset and Checkpoint Optimization Engineer

Together AI • San Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

About Us

Together.ai is a leader in developing AI infrastructure that powers the training of state-of-the-art models. We focus on creating scalable, efficient systems for handling massive datasets and managing large-scale distributed checkpoints, ensuring seamless workflows for training and fine-tuning AI models.

We are seeking a Training Dataset and Checkpoint Acceleration Engineer to optimize data pipelines and checkpoint mechanisms for large-scale machine learning workloads. In this role, you will work at the intersection of data engineering and distributed systems, ensuring that training workflows are highly performant, reliable, and cost-efficient.

Responsibilities

  • Dataset Acceleration :

Design and optimize high-throughput data pipelines for streaming and processing massive training datasets.

  • Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency.
  • Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph).
  • Checkpointing Systems :
  • Build and optimize distributed checkpoint mechanisms for large-scale training workflows.

  • Implement techniques to minimize checkpoint I / O overhead and ensure fault tolerance.
  • Develop incremental and differential checkpointing solutions to reduce storage costs.
  • Performance Optimization :
  • Profile and debug bottlenecks in data pipelines and checkpoint systems.

  • Optimize for GPU / TPU utilization by ensuring efficient data feeding and checkpoint recovery times.
  • Scalability and Reliability :
  • Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets.

  • Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs.
  • Collaboration and Support :
  • Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements.

  • Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows.
  • Qualifications

    Must-Have :

  • Experience :
  • 5+ years of experience in data engineering, distributed systems, or ML infrastructure.

  • Technical Skills :
  • Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI).

  • Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5).
  • Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS).
  • Programming :
  • Proficient in Python, C++, or Go for performance-critical systems.

  • Optimization Techniques :
  • Experience with I / O optimization techniques (e.g., asynchronous data loading, prefetching).

  • Familiarity with compression and serialization for large datasets and checkpoints.
  • Soft Skills :
  • Analytical and problem-solving mindset.

  • Strong communication and collaboration skills across teams.
  • Nice-to-Have :

  • Experience with ML frameworks (e.g., PyTorch, TensorFlow, JAX) and distributed training.
  • Familiarity with hardware accelerators (e.g., GPUs, TPUs) and storage optimizations.
  • Knowledge of open-source contributions or projects related to data pipelines or checkpointing.
  • Experience with incremental and real-time checkpointing solutions.
  • About Together AI

    Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

    Compensation

    We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

    Equal Opportunity

    Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

    Please see our privacy policy at https : / / www.together.ai / privacy

    #J-18808-Ljbffr

    Create a job alert for this search

    Llm And Optimization • San Francisco, CA, United States

    Related jobs
    Distributed LLM Inference Engineer

    Distributed LLM Inference Engineer

    Anyscale, Inc • San Francisco, CA, United States
    Full-time
    At Anyscale, we're on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We're commercializing Ray, a popular open-source project that'...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Intellipro Group • San Francisco, CA, United States
    Full-time
    Machine Learning Engineer, Training Infrastructure.We are looking for an ML Engineer with 3+ YOE in high-performance computing systems to manage and optimize our computational infrastructure for tr...Show more
    Last updated: 18 days ago • Promoted
    ML Research Engineer - Training

    ML Research Engineer - Training

    Achira • San Francisco, CA, United States
    Full-time
    Join a world‑class team of scientists, ML researchers, and engineers working together to make the physical microcosm predictable and reshape the future of drug discovery. Move beyond the beaten path...Show more
    Last updated: 30+ days ago • Promoted
    MLE, ML Platform

    MLE, ML Platform

    zaimler • San Mateo, CA, United States
    Full-time
    We're creating the foundation for AI systems that don't just generate, but retrieve, link, and reason over enterprise knowledge. In just over a year, we've begun partnering with Fortune 500 design p...Show more
    Last updated: 18 days ago • Promoted
    Training : ML Framework Engineer

    Training : ML Framework Engineer

    OpenAI • San Francisco, CA, United States
    Full-time
    Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate...Show more
    Last updated: 18 days ago • Promoted
    Reinforcement Learning Engineer

    Reinforcement Learning Engineer

    Code Metal • San Francisco, CA, United States
    Full-time
    At Code Metal AI, you'll be part of a world class team with talent from MIT, OpenAI and other top companies, focused on pioneering work in large language models (LLMs) and code generation.Our proje...Show more
    Last updated: 10 days ago • Promoted
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Saxon Global • Menlo Park, CA, United States
    Full-time
    Strong foundation in machine learning, deep learning, and computer vision.Experience with distributed systems and scalable ML infrastructure. Proficient in Python and software development best pract...Show more
    Last updated: 15 days ago • Promoted
    Applied AI / ML Engineer

    Applied AI / ML Engineer

    Catalyst Labs, LLC • Menlo Park, CA, United States
    Full-time
    About the job Applied AI / ML Engineer.Catalyst Labs is a leading talent agency with a specialized vertical in Applied AI, Machine Learning, and Data Science. We stand out as an agency thats deeply ...Show more
    Last updated: 18 days ago • Promoted
    Distributed Training Engineer

    Distributed Training Engineer

    Periodic Labs • Menlo Park, CA, United States
    Full-time
    We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries.We are well funded and growing rapidly. Team members are owners who identity and solve prob...Show more
    Last updated: 17 days ago • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    HEDRA INC • San Francisco, CA, United States
    Full-time
    Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show more
    Last updated: 30+ days ago • Promoted
    Software Engineer (Technical Leadership) - Machine Learning Specialist

    Software Engineer (Technical Leadership) - Machine Learning Specialist

    META • Menlo Park, CA, United States
    Full-time
    Meta is seeking Machine Learning Engineers to join our engineering team.The ideal candidate will have industry experience working on a range of classification and optimization problems like payment...Show more
    Last updated: 30+ days ago • Promoted
    NLP Engineer - Production ML for PII Redaction (Remote)

    NLP Engineer - Production ML for PII Redaction (Remote)

    TonicAI • San Francisco, CA, United States
    Remote
    Full-time
    A leading data privacy firm in San Francisco is seeking a hands-on Machine Learning Engineer to develop production-grade NLP systems. The ideal candidate will have over 3 years of experience in appl...Show more
    Last updated: 1 hour ago • Promoted • New!
    AI Engineer - LLM Infra

    AI Engineer - LLM Infra

    Yutori • San Francisco, CA, United States
    Full-time
    Yutori is reimagining how people interact with the web by building AI agents that can reliably do everyday digital tasks. We are building the entire stack to be agent-first, from training our own mo...Show more
    Last updated: 30+ days ago • Promoted
    Staff Machine Learning Engineer, ML Performance & Optimization

    Staff Machine Learning Engineer, ML Performance & Optimization

    Waymo • San Francisco, CA, United States
    Full-time
    Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show more
    Last updated: 18 days ago • Promoted
    AI Engineer, Multimodal LLMs

    AI Engineer, Multimodal LLMs

    Eloquent AI, Inc. • San Francisco, CA, United States
    Full-time
    At Eloquent AI, we build reliable, high-performance AI agents that solve real-world problems.Our mission is to create precise, scalable AI solutions that businesses can depend on.With a culture of ...Show more
    Last updated: 10 days ago • Promoted
    Lead Machine Learning Engineer

    Lead Machine Learning Engineer

    Pubmatic • Redwood City, CA, United States
    Full-time
    We are immediately hiring a results-oriented.Reporting to the Director of Machine Learning, you will partner with Product and Engineering teams to both solve problems and identify new opportunities...Show more
    Last updated: 30+ days ago • Promoted
    Lecturer - Information and Cybersecurity - School of Information

    Lecturer - Information and Cybersecurity - School of Information

    InsideHigherEd • Berkeley, California, United States
    Full-time +1
    Lecturer - Information and Cybersecurity - School of Information.The starting, full-time equivalent annual salary rate is currently $140,169. Appointments are typically from one to three sections pe...Show more
    Last updated: 30+ days ago • Promoted
    Software Engineer, Training & Inference Infrastructure

    Software Engineer, Training & Inference Infrastructure

    datologyai • Redwood City, CA, United States
    Full-time
    But a large portion of training compute is wasted training on data that are already learned, irrelevant, or even harmful, leading to worse models that cost more to train and deploy.At DatologyAI, w...Show more
    Last updated: 30+ days ago • Promoted