Talent.com
LLM Training Dataset and Checkpoint Optimization Engineer
LLM Training Dataset and Checkpoint Optimization EngineerTogether AI • San Francisco, CA, United States
LLM Training Dataset and Checkpoint Optimization Engineer

LLM Training Dataset and Checkpoint Optimization Engineer

Together AI • San Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

About Us

Together.ai is a leader in developing AI infrastructure that powers the training of state-of-the-art models. We focus on creating scalable, efficient systems for handling massive datasets and managing large-scale distributed checkpoints, ensuring seamless workflows for training and fine-tuning AI models.

We are seeking a Training Dataset and Checkpoint Acceleration Engineer to optimize data pipelines and checkpoint mechanisms for large-scale machine learning workloads. In this role, you will work at the intersection of data engineering and distributed systems, ensuring that training workflows are highly performant, reliable, and cost-efficient.

Responsibilities

  • Dataset Acceleration :

Design and optimize high-throughput data pipelines for streaming and processing massive training datasets.

  • Implement caching, sharding, and prefetching techniques to maximize data-loading efficiency.
  • Ensure efficient integration with distributed storage systems (e.g., S3, GCS, Lustre, Ceph).
  • Checkpointing Systems :
  • Build and optimize distributed checkpoint mechanisms for large-scale training workflows.

  • Implement techniques to minimize checkpoint I / O overhead and ensure fault tolerance.
  • Develop incremental and differential checkpointing solutions to reduce storage costs.
  • Performance Optimization :
  • Profile and debug bottlenecks in data pipelines and checkpoint systems.

  • Optimize for GPU / TPU utilization by ensuring efficient data feeding and checkpoint recovery times.
  • Scalability and Reliability :
  • Develop systems that scale efficiently across thousands of nodes and petabyte-scale datasets.

  • Ensure fault-tolerant recovery and resume mechanisms for long-running training jobs.
  • Collaboration and Support :
  • Work closely with ML researchers, data engineers, and infrastructure teams to understand workload requirements.

  • Build tools and frameworks to enable seamless integration of dataset and checkpointing systems with existing ML workflows.
  • Qualifications

    Must-Have :

  • Experience :
  • 5+ years of experience in data engineering, distributed systems, or ML infrastructure.

  • Technical Skills :
  • Expertise in high-performance data processing libraries (e.g., PyTorch DataLoader, TensorFlow Data, DALI).

  • Proficiency in distributed storage systems and data formats (e.g., Parquet, HDF5).
  • Strong understanding of checkpointing frameworks and file systems (e.g., POSIX, Lustre, GPFS).
  • Programming :
  • Proficient in Python, C++, or Go for performance-critical systems.

  • Optimization Techniques :
  • Experience with I / O optimization techniques (e.g., asynchronous data loading, prefetching).

  • Familiarity with compression and serialization for large datasets and checkpoints.
  • Soft Skills :
  • Analytical and problem-solving mindset.

  • Strong communication and collaboration skills across teams.
  • Nice-to-Have :

  • Experience with ML frameworks (e.g., PyTorch, TensorFlow, JAX) and distributed training.
  • Familiarity with hardware accelerators (e.g., GPUs, TPUs) and storage optimizations.
  • Knowledge of open-source contributions or projects related to data pipelines or checkpointing.
  • Experience with incremental and real-time checkpointing solutions.
  • About Together AI

    Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

    Compensation

    We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

    Equal Opportunity

    Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

    Please see our privacy policy at https : / / www.together.ai / privacy

    #J-18808-Ljbffr

    Create a job alert for this search

    Llm And Optimization • San Francisco, CA, United States

    Related jobs
    Machine Learning Engineer, Distributed Training, Optimus

    Machine Learning Engineer, Distributed Training, Optimus

    Tesla Motors, Inc. • Palo Alto, CA, United States
    Full-time
    As a Software Engineer for the Optimus team, you will build the tools and infrastructure to make and measure improvements to neural network architecture, visualize data, assist with exporting and d...Show more
    Last updated: 30+ days ago • Promoted
    Applied ML / LLM Engineer

    Applied ML / LLM Engineer

    Pincites • San Francisco, CA, United States
    Full-time
    We’re looking for a sharp, ambitious.AI-native products — someone who knows how to turn messy real-world data into performant models, fine-tune and deploy LLMs, and design feedback loops that make ...Show more
    Last updated: 4 days ago • Promoted
    LLM Training Resilience Engineer

    LLM Training Resilience Engineer

    Together AI • San Francisco, CA, United States
    Full-time
    Get AI-powered advice on this job and more exclusive features.AI infrastructure development, creating robust platforms and frameworks to support state-of-the-art large-scale machine learning traini...Show more
    Last updated: 30+ days ago • Promoted
    Police Officer - New Recruit (Entry Level)

    Police Officer - New Recruit (Entry Level)

    City and County of San Francisco • Moss Beach, CA, US
    Full-time +1
    Police Officer — New Recruit (Entry-Level).San Francisco Police Department (Q002) | .Full-time, Permanent Civil Service.Comprehensive City & County benefits. Protect life and property through proac...Show more
    Last updated: 2 days ago • Promoted
    Machine Learning Engineer | Python | Pytorch | Distributed Training | Optimisation | GPU | Hybrid, San Jose, CA

    Machine Learning Engineer | Python | Pytorch | Distributed Training | Optimisation | GPU | Hybrid, San Jose, CA

    Enigma • San Jose, CA, United States
    Full-time
    Machine Learning Engineer | Python | Pytorch | Distributed Training | Optimisation | GPU | Hybrid, San Jose, CA.Productize and optimize models from Research into reliable, performant, and cost-effi...Show more
    Last updated: 7 days ago • Promoted
    LLM Training Frameworks and Optimization Engineer

    LLM Training Frameworks and Optimization Engineer

    Together AI • San Francisco, CA, United States
    Full-time
    LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer. LLM Training Frameworks and Optimization Engine...Show more
    Last updated: 30+ days ago • Promoted
    Senior LLM Engineer

    Senior LLM Engineer

    Conviva • Foster City, CA, United States
    Full-time
    Conviva is the first and best place to go to understand and optimize digital customer experiences.Our Operational Data Platform harnesses full-census, comprehensive client-side telemetry—capturing ...Show more
    Last updated: 30+ days ago • Promoted
    IRL Tech III

    IRL Tech III

    KA Recruiting Inc. • Orinda, CA, US
    Full-time +1
    IRL Tech III openings at a beautiful facility in the Oakland, CA area.MANY different shift options available.If you are interested in learning more, or if you are a healthcare profes...Show more
    Last updated: 30+ days ago • Promoted
    Product Development Engineer, Reagents

    Product Development Engineer, Reagents

    Bruker • Emeryville, CA, United States
    Full-time +1
    Product Development Engineer, Reagents.Bruker is enabling scientists to make breakthrough discoveries and develop new applications that improve the quality of human life. Bruker's high-performance s...Show more
    Last updated: 15 days ago • Promoted
    LLM Training Frameworks and Optimization Engineer

    LLM Training Frameworks and Optimization Engineer

    Together • San Francisco, CA, United States
    Full-time
    LLM Training Frameworks and Optimization Engineer.We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost‑efficien...Show more
    Last updated: 4 days ago • Promoted
    Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

    Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

    Socotra, Inc. • San Francisco, CA, United States
    Full-time
    Build the Future of Scalable AI at TrueFoundry.ML teams train, deploy, and scale their models.Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models rel...Show more
    Last updated: 30+ days ago • Promoted
    Lab Systems Training & Adoption Lead

    Lab Systems Training & Adoption Lead

    Concord • Fremont, CA, US
    Full-time
    Location : Bay Area, CA (On-site).Might require occasional travel to Santa Clara and Tucson.Employment Type : Contract (W2 or C2C). Possibility of renewal depending on personal performance and busines...Show more
    Last updated: 1 day ago • Promoted
    Training : ML Framework Engineer

    Training : ML Framework Engineer

    OpenAI • San Francisco, CA, United States
    Full-time
    Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier‑scale model runs. With a dual mandate to accelerate...Show more
    Last updated: 21 days ago • Promoted
    Machine Learning Engineer - Post Training

    Machine Learning Engineer - Post Training

    EPM Scientific • San Francisco County, CA, United States
    Full-time
    Machine Learning Engineer - Post Training.A stealth-stage venture backed by Lux Capital (investors in DeepMind and OpenAI) is developing frontier-scale AI systems for high-impact applications in hu...Show more
    Last updated: 15 days ago • Promoted
    Tech Lead Manager- MLRE, ML Systems

    Tech Lead Manager- MLRE, ML Systems

    Scale AI, Inc. • San Francisco, CA, United States
    Full-time
    Scale's LLM post-training platform team builds our internal distributed framework for large language model training.The platform powers MLEs, researchers, data scientists, and operators for fast an...Show more
    Last updated: 30+ days ago • Promoted
    Lecturer - Information and Cybersecurity - School of Information

    Lecturer - Information and Cybersecurity - School of Information

    InsideHigherEd • Berkeley, California, United States
    Full-time +1
    Lecturer - Information and Cybersecurity - School of Information.The starting, full-time equivalent annual salary rate is currently $140,169. Appointments are typically from one to three sections pe...Show more
    Last updated: 27 days ago • Promoted
    Qualified Moms Wanted : Healthy Prior Delivery + Age 21–36 (Surrogacy $50k–$100k)

    Qualified Moms Wanted : Healthy Prior Delivery + Age 21–36 (Surrogacy $50k–$100k)

    Ivy Surrogacy • Montara, CA, US
    Full-time +1
    Becoming a surrogate mother is one of the greatest gifts of life!.Ivy Surrogacy is a third-party reproductive agency for parents all over the world seeking. At Ivy Surrogacy, we genuinely believe we...Show more
    Last updated: 1 day ago • Promoted
    Peer Recovery Coach

    Peer Recovery Coach

    Telecare Corporation • Redwood City, CA, United States
    Full-time +2
    We have over 300 Peer roles at Telecare.We value this lived experience and this is what we are trying to grow within the organization. We have a career ladder specific to our Peer Workforce.What You...Show more
    Last updated: 14 days ago • Promoted