Talent.com
No longer accepting applications
LLM Training Frameworks and Optimization Engineer

LLM Training Frameworks and Optimization Engineer

Together AISan Francisco, CA, US
1 day ago
Job type
  • Full-time
Job description

About Us

At Together.ai, we are building cutting-edge infrastructure to enable efficient and scalable training of large language models (LLMs). We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost-efficiency.

We are seeking a LLM Training Frameworks and Optimization Engineer to drive innovations in the development and optimization of distributed training frameworks. In this role, you will ensure that our LLM training pipelines are robust, efficient, and capable of handling the complexities of large-scale distributed systems.

Responsibilities

  • Framework Development and Optimization :

Design, implement, and optimize distributed training frameworks tailored for large language models.

  • Develop custom modules, plugins, and features to enhance framework scalability and performance.
  • Algorithmic and Systems Optimization :
  • Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training.

  • Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training.
  • Performance Tuning :
  • Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks.

  • Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators.
  • Scalability and Resilience :
  • Ensure training systems scale efficiently to thousands of nodes and petabytes of data.

  • Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines.
  • Collaboration and Support :
  • Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements.

  • Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle.
  • Qualifications

    Must-Have :

  • Experience :
  • 5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure.

  • Technical Skills :
  • Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA).

  • Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism).
  • Familiarity with GPU / TPU hardware and deep learning performance optimizations.
  • Programming :
  • Proficient in Python and C++ or CUDA for high-performance computing.

  • Optimization Techniques :
  • Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding).

  • Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization.
  • Soft Skills :
  • Analytical problem-solving skills and a focus on performance improvement.

  • Strong collaboration and communication skills across teams.
  • Nice-to-Have :

  • Familiarity with graph optimization and compiler-level performance tuning.
  • Contributions to open-source deep learning or distributed training projects.
  • Experience with low-level hardware optimizations (e.g., kernel fusion, custom CUDA kernels).
  • About Together AI

    Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

    Compensation

    We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

    Equal Opportunity

    Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

    Please see our privacy policy at https : / / www.together.ai / privacy

    #J-18808-Ljbffr

    Create a job alert for this search

    Optimization Engineer • San Francisco, CA, US

    Related jobs
    • Promoted
    Principal LLM Application Engineer

    Principal LLM Application Engineer

    AllyNd PartnersPalo Alto, CA, United States
    Full-time
    About the job Principal LLM Application Engineer.AllyNd's client is driving SOC transformation with its unique application of AI computing, initially focusing on generative AI-powered proactive thr...Show moreLast updated: 3 days ago
    • Promoted
    Divisional Training Lead

    Divisional Training Lead

    KLAMilpitas, CA, United States
    Full-time
    KLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem.Virtually every electronic device in the world is produced using our technologies.No laptop, smartpho...Show moreLast updated: 29 days ago
    • Promoted
    • New!
    Learning and Development Lead

    Learning and Development Lead

    VirtualVocationsHayward, California, United States
    Full-time
    A company is looking for a Learning and Development Instructional Design Lead to oversee global learning initiatives.Key Responsibilities Collaborate with cross-functional teams to design and imp...Show moreLast updated: 20 hours ago
    • Promoted
    Machine Learning Engineer, GenAI Applied ML

    Machine Learning Engineer, GenAI Applied ML

    Scale AI, Inc.San Francisco, CA, United States
    Full-time
    At Scale AI, our mission is to accelerate the development of AI applications.For 8 years, Scale has been the leading AI data foundry, helping fuel the most exciting advancements in AI, including : g...Show moreLast updated: 30+ days ago
    • Promoted
    Technical Lead Manager, LLM / VLM Foundation Model

    Technical Lead Manager, LLM / VLM Foundation Model

    WaymoMountain View, CA, United States
    Full-time
    Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show moreLast updated: 3 days ago
    • Promoted
    Training Lead

    Training Lead

    Tech DigitalSan Francisco, CA, US
    Full-time
    Slack GTM Excellence Enablement Business Partner.Reporting into the Slack GTM Excellence organization, this person will act as the one of the key enablement business partners, aligned to Slack Sell...Show moreLast updated: 30+ days ago
    • Promoted
    Software Engineer, ML Infrastructure - Training Platform

    Software Engineer, ML Infrastructure - Training Platform

    Scale AI, Inc.San Francisco, CA, United States
    Full-time
    Scale is looking for an AI / ML Infrastructure Engineer to join our Machine Learning Infrastructure team to build out our Training Platform. You will partner closely with Machine Learning researchers ...Show moreLast updated: 30+ days ago
    • Promoted
    Technical Digital PLM Consultant

    Technical Digital PLM Consultant

    VirtualVocationsSan Francisco, California, United States
    Full-time
    A company is looking for a Technical Digital PLM Senior Consultant.Key Responsibilities Collaborate with client IT and business teams to understand systems landscape and integration needs Transl...Show moreLast updated: 2 days ago
    • Promoted
    Technical Lead, ML Training Infrastructure

    Technical Lead, ML Training Infrastructure

    NuroMountain View, CA, United States
    Full-time
    Nuro is a self-driving technology company on a mission to make autonomy accessible to all.Founded in 2016, Nuro is building the world's most scalable driver, combining cutting-edge AI with automoti...Show moreLast updated: 3 days ago
    • Promoted
    Technical Lead Manager, ML Training Infrastructure

    Technical Lead Manager, ML Training Infrastructure

    NuroMountain View, CA, United States
    Full-time
    Nuro is a self-driving technology company on a mission to make autonomy accessible to all.Founded in 2016, Nuro is building the world's most scalable driver, combining cutting-edge AI with automoti...Show moreLast updated: 3 days ago
    • Promoted
    Engineering Manager, ML Training Platform

    Engineering Manager, ML Training Platform

    ZooxSan Mateo, CA, US
    Full-time
    Engineering Manager, ML Training Platform.Software Software & Machine Learning Infrastructure / Full-time / Hybrid.Zoox is on a mission to reimagine transportation and ground-up build autonomous r...Show moreLast updated: 30+ days ago
    • Promoted
    LMS Training Administrator

    LMS Training Administrator

    VirtualVocationsFremont, California, United States
    Full-time
    A company is looking for a Veeva LMS Training Administrator.Key Responsibilities Manage and administer the Veeva LMS, including user setup, course assignments, and reporting Collaborate with dep...Show moreLast updated: 1 day ago
    • Promoted
    Senior MLOps Engineer

    Senior MLOps Engineer

    VirtualVocationsSan Francisco, California, United States
    Full-time
    A company is looking for a Senior MLOps Engineer to design and scale infrastructure for AI research and product development. Key Responsibilities Identify and resolve infrastructure and software b...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Manager, Learning Operations

    Senior Manager, Learning Operations

    VirtualVocationsFremont, California, United States
    Full-time
    A company is looking for a Senior Manager, Learning & Technology Operations & Compliance.Key Responsibilities Oversee the learning management system (Docebo), managing content governance and user...Show moreLast updated: 1 day ago
    • Promoted
    LMS Specialist

    LMS Specialist

    Robert HalfSan Francisco, CA, US
    Full-time
    We are transitioning from Absorb LMS to Skilljar (now Gainsight) and are seeking an LMS Implementation Specialist to support our migration and implementation efforts. With approximately 25,000 learn...Show moreLast updated: 9 days ago
    • Promoted
    Training Manager

    Training Manager

    Calyxo, Inc.Pleasanton, CA, United States
    Full-time
    The company was founded in 2016 to address the profound need for improved kidney stone treatment.Kidney stone disease is a common, painful condition that consumes vast amounts of healthcare resourc...Show moreLast updated: 7 days ago
    • Promoted
    QA and Training Lead

    QA and Training Lead

    VirtualVocationsOakland, California, United States
    Full-time
    A company is looking for a QA & Training Lead (Customer Support).Key Responsibilities Define and implement QA frameworks, policies, and best practices for compliance with gambling regulations Le...Show moreLast updated: 2 days ago
    • Promoted
    AI / ML Engineer (LLM Optimization & AI-Driven Workflows)

    AI / ML Engineer (LLM Optimization & AI-Driven Workflows)

    BiomichealthSan Francisco, CA, United States
    Full-time
    Legion Health | AI-Driven Psychiatric Care – We’re Hiring!.Join us in building the most efficient, AI-powered mental healthcare system. We’re a YC-backed company revolutionizing telepsychiatry—not w...Show moreLast updated: 20 days ago
    LLM or GenAI Application Engineer

    LLM or GenAI Application Engineer

    FocusKPI Inc.Mountain View, CA, US
    Temporary
    Quick Apply
    FocusKPI is looking for an LLM or GenAI Application Engineer to join one of our clients, a high-tech SaaS company.An LLM or GenAI Application Engineer or LLM Research Engineer role ...Show moreLast updated: 17 days ago
    • Promoted
    Training and Onboarding Specialist

    Training and Onboarding Specialist

    VirtualVocationsSan Jose, California, United States
    Full-time
    A company is looking for a Training & Onboarding Specialist to assist customers during their onboarding process and ensure successful integration of the platform. Key Responsibilities Facilitate p...Show moreLast updated: 2 days ago