Talent.com
LLM Training Resilience Engineer
LLM Training Resilience EngineerTogether AI • San Francisco, CA, United States
LLM Training Resilience Engineer

LLM Training Resilience Engineer

Together AI • San Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

Get AI-powered advice on this job and more exclusive features.

About The Role Together.ai is at the forefront of AI infrastructure development, creating robust platforms and frameworks to support state-of-the-art large-scale machine learning training. We specialize in delivering resilient, high-performance systems that power breakthroughs in AI research and deployment.

About The Role Together.ai is at the forefront of AI infrastructure development, creating robust platforms and frameworks to support state-of-the-art large-scale machine learning training. We specialize in delivering resilient, high-performance systems that power breakthroughs in AI research and deployment.

We are seeking a Large-scale Training Resilience Engineer to ensure the reliability, fault tolerance, and scalability of our large-scale training infrastructure. If you are passionate about solving complex distributed systems problems and building highly available AI training pipelines, this role is for you.

Responsibilities

  • Resilience and Fault Tolerance Design :

Develop systems to identify, isolate, and recover from failures in large-scale distributed training workloads.

  • Implement proactive error-detection mechanisms, including straggler detection and fault prediction algorithms.
  • Distributed System Optimization :
  • Ensure stability and consistency across distributed training clusters (e.g., GPU / TPU clusters).

  • Optimize recovery time and throughput in the face of hardware or software failures.
  • Monitoring and Observability :
  • Design and maintain observability systems for monitoring cluster health, training performance, and failure patterns.

  • Leverage telemetry data to improve incident response and automate mitigation strategies.
  • Automation and Tooling :
  • Build resilience-focused tooling, such as job health monitors, distributed checkpoint systems, and automated recovery workflows.

  • Enhance debugging and diagnosis frameworks for distributed training jobs.
  • Collaboration and Documentation :
  • Collaborate with platform engineers, researchers, and ML practitioners to identify pain points and resilience requirements.

  • Document and communicate best practices for fault-tolerant AI training.
  • Requirements

    Must-Have :

  • Experience :
  • 5+ years of experience in distributed systems, cloud infrastructure, or large-scale machine learning training.

  • Technical Skills :
  • Proficiency in distributed computing frameworks (e.g., PyTorch DDP, TensorFlow, Horovod).

  • Strong knowledge of resilience strategies in distributed systems (e.g., leader election, consensus, retry mechanisms).
  • Hands-on experience with observability tools (e.g., Prometheus, Grafana, ELK stack).
  • Programming :
  • Proficient in Python, Go, or a similar programming language.

  • Infrastructure :
  • Experience working with cloud platforms (e.g., AWS, GCP, Azure) and Kubernetes for workload orchestration.

  • Soft Skills :
  • Strong analytical, problem-solving, and debugging skills.

  • Excellent collaboration and communication skills.
  • Nice-to-Have

  • Familiarity with GPU / TPU cluster management and scheduling.
  • Experience with high-availability database systems or message queues.
  • Experience with open-source contributions or community engagement.
  • About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

    Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

    Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

    Please see our privacy policy at https : / / www.together.ai / privacy

    We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

    #J-18808-Ljbffr

    Create a job alert for this search

    Engineer Llm • San Francisco, CA, United States

    Related jobs
    Lecturer Pool - Energy and Resources Group - Rausser College of Natural Resources

    Lecturer Pool - Energy and Resources Group - Rausser College of Natural Resources

    InsideHigherEd • Berkeley, California, United States
    Full-time +2
    Lecturer Pool - Energy and Resources Group - Rausser College of Natural Resources.The UC academic salary scales set the minimum pay at appointment. See the following table(s) for the current salary ...Show more
    Last updated: 11 days ago • Promoted
    ML Research Engineer - Training

    ML Research Engineer - Training

    Achira • San Francisco, CA, United States
    Full-time
    Join a world‑class team of scientists, ML researchers, and engineers working together to make the physical microcosm predictable and reshape the future of drug discovery. Move beyond the beaten path...Show more
    Last updated: 30+ days ago • Promoted
    Lecturer - Development Engineering - The Blum Center for Developing Economies

    Lecturer - Development Engineering - The Blum Center for Developing Economies

    InsideHigherEd • Berkeley, California, United States
    Full-time
    Lecturer - Development Engineering - The Blum Center for Developing Economies.Starting salary is commensurate with highest degree, teaching experience, and equity within the Department.The UC acade...Show more
    Last updated: 30+ days ago • Promoted
    Lecturer - Economic Analysis and Policy - Business

    Lecturer - Economic Analysis and Policy - Business

    InsideHigherEd • Berkeley, California, United States
    Full-time +1
    Lecturer - Economic Analysis and Policy - Business.Lecturer, Lecturer in Summer Sessions.The UC academic salary scales set the minimum pay at appointment. See the following table for the current sal...Show more
    Last updated: 15 days ago • Promoted
    Lecturer - Agricultural and Resource Economics - College of Natural Resources

    Lecturer - Agricultural and Resource Economics - College of Natural Resources

    InsideHigherEd • Berkeley, California, United States
    Full-time +2
    Lecturer - Agricultural and Resource Economics - College of Natural Resources.The posted UC academic salary scales set the minimum pay at appointment. See the following table for the current salary ...Show more
    Last updated: 30+ days ago • Promoted
    Lecturer Pool - Design Studio Classes- Department of Architecture

    Lecturer Pool - Design Studio Classes- Department of Architecture

    InsideHigherEd • Berkeley, California, United States
    Permanent
    Lecturer Pool - Design Studio Classes- Department of Architecture.The posted UC academic salary scales set the minimum pay at appointment. See the following table for the salary scale for this posit...Show more
    Last updated: 12 days ago • Promoted
    LLM Training Frameworks and Optimization Engineer

    LLM Training Frameworks and Optimization Engineer

    Together AI • San Francisco, CA, United States
    Full-time
    LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer. LLM Training Frameworks and Optimization Engine...Show more
    Last updated: 30+ days ago • Promoted
    Lecturer pool - Environmental Design & Masters in Urban Design

    Lecturer pool - Environmental Design & Masters in Urban Design

    InsideHigherEd • Berkeley, California, United States
    Permanent
    Lecturer pool - Environmental Design & Masters in Urban Design.The posted UC academic salary scales set the minimum pay at appointment. See the following table for the salary scale for this position...Show more
    Last updated: 12 days ago • Promoted
    LLM Training Frameworks and Optimization Engineer

    LLM Training Frameworks and Optimization Engineer

    Together • San Francisco, CA, United States
    Full-time
    LLM Training Frameworks and Optimization Engineer.We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost‑efficien...Show more
    Last updated: 4 days ago • Promoted
    Lecturer - Sustainable and Impact Finance - Haas School of Business

    Lecturer - Sustainable and Impact Finance - Haas School of Business

    InsideHigherEd • Berkeley, California, United States
    Full-time +1
    Lecturer - Sustainable and Impact Finance - Haas School of Business.The UC academic salary scales set the minimum pay at appointment. See the following table for the current salary scale for this po...Show more
    Last updated: 30+ days ago • Promoted
    Lecturer Pool - Jacobs Institute for Design Innovation - College of Engineering

    Lecturer Pool - Jacobs Institute for Design Innovation - College of Engineering

    InsideHigherEd • Berkeley, California, United States
    Full-time +1
    Lecturer Pool - Jacobs Institute for Design Innovation - College of Engineering.The UC academic salary scales set the minimum pay at appointment. See the following table for the current salary scale...Show more
    Last updated: 28 days ago • Promoted
    Lecturer Diversity, Equity, and Inclusion Haas School of Business

    Lecturer Diversity, Equity, and Inclusion Haas School of Business

    InsideHigherEd • Berkeley, California, United States
    Full-time +1
    Lecturer Diversity, Equity, and Inclusion Haas School of Business.The UC academic salary scales set the minimum pay at appointment. See the following table for the current salary scale for this po...Show more
    Last updated: 27 days ago • Promoted
    Lecturer Pool-City & Regional Planning - College of Environmental Design

    Lecturer Pool-City & Regional Planning - College of Environmental Design

    InsideHigherEd • Berkeley, California, United States
    Permanent
    Lecturer Pool-City & Regional Planning - College of Environmental Design.The posted UC academic salary scales set the minimum pay at appointment. See the following table for the salary scale for thi...Show more
    Last updated: 12 days ago • Promoted
    Summer Programs Lecturer pool - College of Environmental Design Dean's Office

    Summer Programs Lecturer pool - College of Environmental Design Dean's Office

    InsideHigherEd • Berkeley, California, United States
    Part-time +1
    Summer Programs Lecturer pool - College of Environmental Design Dean's Office.Lecturer, Co-Lecturer, or Associate Lecturer. The posted UC academic salary scales set the minimum pay at appointment.Se...Show more
    Last updated: 17 days ago • Promoted
    Lecturer Pool - Department of Environmental Science, Policy, and Management

    Lecturer Pool - Department of Environmental Science, Policy, and Management

    InsideHigherEd • Berkeley, California, United States
    Full-time +2
    Posted by the FREE value-added recruitment advertising agency.Show more
    Last updated: 13 days ago • Promoted
    Lecturer Pool - Residential and Online Instruction - School of Public Health

    Lecturer Pool - Residential and Online Instruction - School of Public Health

    InsideHigherEd • Berkeley, California, United States
    Full-time
    Lecturer Pool - Residential and Online Instruction - School of Public Health.The UC academic salary scales set the minimum pay at appointment. See the following tables for the current salary scales ...Show more
    Last updated: 30+ days ago • Promoted
    Lecturer - Open Pool - School of Social Welfare

    Lecturer - Open Pool - School of Social Welfare

    InsideHigherEd • Berkeley, California, United States
    Full-time
    Lecturer - Open Pool - School of Social Welfare.The UC academic salary scales set the minimum pay at appointment.See the following table(s) for the current salary scale(s) for this position : .A reas...Show more
    Last updated: 27 days ago • Promoted
    Lecturer Pool-Landscape Architecture and Environmental Planning

    Lecturer Pool-Landscape Architecture and Environmental Planning

    InsideHigherEd • Berkeley, California, United States
    Full-time +1
    Lecturer Pool-Landscape Architecture and Environmental Planning.The posted UC academic salary scales set the minimum pay at appointment. See the following table for the salary scale for this positio...Show more
    Last updated: 30+ days ago • Promoted