Talent.com
LLM Training Frameworks and Optimization Engineer
LLM Training Frameworks and Optimization EngineerTogether AI • San Francisco, CA, United States
LLM Training Frameworks and Optimization Engineer

LLM Training Frameworks and Optimization Engineer

Together AI • San Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

LLM Training Frameworks and Optimization Engineer

Join to apply for the LLM Training Frameworks and Optimization Engineer role at Together AI

LLM Training Frameworks and Optimization Engineer

Join to apply for the LLM Training Frameworks and Optimization Engineer role at Together AI

Role

At Together.ai, we are building cutting-edge infrastructure to enable efficient and scalable training of large language models (LLMs). We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost-efficiency.

Role

At Together.ai, we are building cutting-edge infrastructure to enable efficient and scalable training of large language models (LLMs). We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost-efficiency.

We are seeking a LLM Training Frameworks and Optimization Engineer to drive innovations in the development and optimization of distributed training frameworks. In this role, you will ensure that our LLM training pipelines are robust, efficient, and capable of handling the complexities of large-scale distributed systems.

Responsibilities

  • Framework Development and Optimization :
  • Design, implement, and optimize distributed training frameworks tailored for large language models.
  • Develop custom modules, plugins, and features to enhance framework scalability and performance.
  • Algorithmic and Systems Optimization :
  • Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training.
  • Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training.
  • Performance Tuning :
  • Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks.
  • Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators.
  • Scalability and Resilience :
  • Ensure training systems scale efficiently to thousands of nodes and petabytes of data.
  • Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines.
  • Collaboration and Support :
  • Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements.
  • Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle.

Requirements

Must-Have :

  • Experience :
  • 5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure.
  • Technical Skills :
  • Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA).
  • Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism).
  • Familiarity with GPU / TPU hardware and deep learning performance optimizations.
  • Programming :
  • Proficient in Python and C++ or CUDA for high-performance computing.
  • Optimization Techniques :
  • Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding).
  • Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization.
  • Soft Skills :
  • Analytical problem-solving skills and a focus on performance improvement.
  • Strong collaboration and communication skills across teams.
  • Nice-to-Have

  • Familiarity with graph optimization and compiler-level performance tuning.
  • Contributions to open-source deep learning or distributed training projects.
  • Experience with low-level hardware optimizations (e.g., kernel fusion, custom CUDA kernels).
  • About Together AI

    Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

    Compensation

    We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

    Equal Opportunity

    Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

    Please see our privacy policy at https : / / www.together.ai / privacy

    Seniority level

    Seniority level

    Mid-Senior level

    Employment type

    Employment type

    Full-time

    Job function

    Job function

    Engineering and Information Technology

    Industries

    Software Development

    Referrals increase your chances of interviewing at Together AI by 2x

    Sign in to set job alerts for “Optimization Engineer” roles.

    San Francisco, CA $167,000.00-$185,500.00 6 days ago

    San Francisco, CA $130,000.00-$145,000.00 2 weeks ago

    Staff Optimization Engineer, Dynamic Pricing

    San Francisco, CA $223,000.00-$248,000.00 13 hours ago

    San Francisco, CA $120,000.00-$180,000.00 4 months ago

    Machine Learning Engineer, Forecast Platform

    San Francisco, CA $198,000.00-$220,000.00 5 days ago

    Machine Learning Engineer II - Autonomous Mobility and Delivery

    San Francisco, CA $167,000.00-$185,500.00 3 days ago

    Oakland, CA $90,000.00-$122,000.00 12 hours ago

    San Francisco, CA $120,000.00-$160,000.00 2 weeks ago

    San Francisco, CA $217,400.00-$294,100.00 14 hours ago

    San Francisco, CA $209,700.00-$283,800.00 14 hours ago

    GenAI Staff Machine Learning Engineer, Performance Optimization

    San Francisco, CA $149,998.00-$250,000.00 9 months ago

    San Francisco, CA $117,000.00-$150,000.00 1 month ago

    Process Engineer, application via RippleMatch

    San Francisco, CA $75,000.00-$150,003.00 10 months ago

    San Francisco, CA $117,000.00-$150,000.00 3 weeks ago

    Software Engineer, Performance Optimization

    Redwood City, CA $175,000.00-$220,000.00 1 month ago

    Process Engineer, application via RippleMatch

    Redwood City, CA $142,000.00-$158,000.00 3 weeks ago

    Staff Deep Learning Engineer, Perception

    San Francisco, CA $193,375.00-$227,500.00 5 months ago

    San Mateo, CA $233,840.00-$283,780.00 3 days ago

    San Francisco, CA $100,000.00-$150,000.00 2 weeks ago

    We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

    #J-18808-Ljbffr

    Create a job alert for this search

    Llm Engineer • San Francisco, CA, United States

    Related jobs
    Distributed Training Engineer, Sora

    Distributed Training Engineer, Sora

    Openai • San Francisco, California, United States
    Full-time
    The Sora team is working on making video a key capability of OpenAI’s foundation models.We are a hybrid research and product team that seeks to understand and expand the capabilities of our video m...Show more
    Last updated: 30+ days ago • Promoted
    Director Workforce Optimization

    Director Workforce Optimization

    Stanford Health Care • Menlo Park, California, USA
    Full-time
    If youre ready to be part of our legacy of hope and innovation we encourage you to take the first step and explore our current job openings. Your best is waiting to be discovered.Day - 08 Hour (Unit...Show more
    Last updated: 1 day ago • Promoted
    Staff ML Platform Engineer Large Scale Training (LLMOps / MLOps)

    Staff ML Platform Engineer Large Scale Training (LLMOps / MLOps)

    Socotra • San Francisco, CA, United States
    Full-time
    Build the Future of Scalable AI at TrueFoundry.ML teams train, deploy, and scale their models.Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models rel...Show more
    Last updated: 17 days ago • Promoted
    ML Research Engineer - Training

    ML Research Engineer - Training

    Achira • San Francisco, CA, United States
    Full-time
    Join a world‑class team of scientists, ML researchers, and engineers working together to make the physical microcosm predictable and reshape the future of drug discovery. Move beyond the beaten path...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Hedra • San Francisco, CA, US
    Full-time
    Machine Learning Engineer, Training Infrastructure Join to apply for the Machine Learning Engineer, Training Infrastructure role at Hedra Overview We are looking for an ML Engineer with 3+ yea...Show more
    Last updated: 30+ days ago • Promoted
    Training : ML Framework Engineer

    Training : ML Framework Engineer

    OpenAI • San Francisco, CA, United States
    Full-time
    Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate...Show more
    Last updated: 17 days ago • Promoted
    Training : ML Framework Engineer

    Training : ML Framework Engineer

    The Rundown AI, Inc. • San Francisco, CA, United States
    Full-time
    Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier‑scale model runs. With a dual mandate to accelerate...Show more
    Last updated: 19 hours ago • Promoted • New!
    ML Engineer

    ML Engineer

    Phizenix • Menlo Park, California, United States
    Full-time +1
    Client Opportunity | Through Phizenix.Phizenix, a certified minority and women-led recruiting firm, is hiring on behalf of an innovative generative AI startup that’s developing diffusion-based larg...Show more
    Last updated: 30+ days ago • Promoted
    Product Development Engineer, Reagents

    Product Development Engineer, Reagents

    Bruker • Emeryville, CA, United States
    Full-time +1
    Product Development Engineer, Reagents.Bruker is enabling scientists to make breakthrough discoveries and develop new applications that improve the quality of human life. Bruker's high-performance s...Show more
    Last updated: 22 days ago • Promoted
    Software Engineer, Performance Optimization

    Software Engineer, Performance Optimization

    Fireworks Ai • Redwood City, California, United States
    Full-time
    Here at Fireworks, we’re building the future of generative AI infrastructure.Fireworks offers the generative AI platform with the highest-quality models and the fastest, most scalable inference.We’...Show more
    Last updated: 30+ days ago • Promoted
    Management Trainee Spring 2026 Internship

    Management Trainee Spring 2026 Internship

    Enterprise • Berkeley, California, USA
    Full-time
    If youre looking to hit the ground running the Enterprise Management Internship will help you build valuable business and leadership skills. For a university / college student the real-world professio...Show more
    Last updated: 13 days ago • Promoted
    LLM Platform Engineer

    LLM Platform Engineer

    Whatnot • San Francisco, CA, United States
    Full-time
    Join the Future of Commerce with Whatnot!.Whatnot is the largest live shopping platform in North America and Europe to buy, sell, and discover the things you love. We’re re-defining e‑commerce by bl...Show more
    Last updated: 1 day ago • Promoted
    Machine Learning Engineer, Distributed & Scalable Training

    Machine Learning Engineer, Distributed & Scalable Training

    Lila Sciences • San Francisco, California, United States
    Full-time
    We’re seeking a ML Engineer specializing in.You’ll design and maintain large-scale training systems, optimize performance for massive models, and integrate cutting-edge techniques to improve effici...Show more
    Last updated: 11 days ago • Promoted
    AI Engineer, Multimodal LLMs

    AI Engineer, Multimodal LLMs

    Eloquent AI • San Francisco, California, United States
    Full-time
    At Eloquent AI, we’re building the next generation of AI Operators—multimodal, autonomous systems that execute complex workflows across fragmented tools with human-level precision.Our technology go...Show more
    Last updated: 30+ days ago • Promoted
    ML Engineer

    ML Engineer

    Catalyst Labs • Menlo Park, California, USA
    Full-time
    Is a rapidly growing Tier 1 VC backed startup based in New York with $60 million in funding revolutionizing how outside sales and service teams work. Their AI technology captures and analyzes real-w...Show more
    Last updated: 11 days ago • Promoted
    Distributed Training Engineer

    Distributed Training Engineer

    Periodic Labs • Menlo Park, CA, United States
    Full-time
    We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries.We are well funded and growing rapidly. Team members are owners who identity and solve prob...Show more
    Last updated: 16 days ago • Promoted
    Software Engineer, Machine Learning Infrastructure

    Software Engineer, Machine Learning Infrastructure

    Datologyai • Redwood City, California, United States
    Full-time
    Companies want to train their own large models on their own data.The current industry standard is to train on a random sample of your data, which is inefficient at best and actively harmful to mode...Show more
    Last updated: 30+ days ago • Promoted
    Customer Training Specialist

    Customer Training Specialist

    Boeing • Berkeley, California, USA
    Full-time +1
    The Boeing Global Services (BGS) team is seeking a.F-15QA Maintenance Training Instructor.This is a long term assignment to Qatar- Al Udeid AFB for up to 2 years. Relocation / assignment benefits to t...Show more
    Last updated: 12 days ago • Promoted