Talent.com
Senior ML Training Engineer
Senior ML Training EngineerAION • Seattle, WA, US
Senior ML Training Engineer

Senior ML Training Engineer

AION • Seattle, WA, US
25 days ago
Job type
  • Full-time
Job description

Job Description

Job Description

AION is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance, AION democratizes access to compute power for AI training, fine-tuning, inference, data labeling, and beyond.

By leveraging underutilized resources such as idle GPUs and data centers, AION provides a scalable, cost-effective, and sustainable solution tailored for developers, researchers, and enterprises.

Led by high-pedigree founders with previous exits, AION is well-funded by major VCs with strategic global partnerships. Headquartered in the US with global presence, the company is building its initial core team in India, London and Seattle.

Who You Are

You're an ML systems engineer who's passionate about building high-performance inference infrastructure. You don't need to be an expert in everything - this field is evolving too rapidly for that - but you have strong fundamentals and the curiosity to dive deep into optimization challenges. You thrive in early-stage environments where you'll learn cutting-edge techniques while building production systems. You think systematically about performance bottlenecks and are excited to push the boundaries of what's possible in AI infrastructure.

Requirements

Key Responsibilities

  • Architect and implement distributed training solutions for customers running pre-training, fine-tuning, and RL workloads on AION infrastructure.
  • Guide customers through large-scale training implementations including data parallelism, model parallelism, and pipeline parallelism strategies.
  • Design and optimize multi-GPU training setups with proper gradient synchronization, communication strategies, and scaling configurations.
  • Optimize and develop POCs for customer training accelerators including efficient data loading pipelines, gradient checkpointing, and memory optimization techniques.
  • Create comprehensive monitoring and debugging frameworks for distributed training jobs with performance tracking and bottleneck resolution.
  • Conduct technical workshops and training sessions on distributed training, reasoning techniques, and post-training optimization methodologies.
  • Support customers with advanced fine-tuning workflows including reward model training, constitutional AI, and alignment techniques.
  • Troubleshoot and resolve customer training bottlenecks including scaling inefficiencies and optimization challenges.
  • Collaborate with tech and product teams to translate customer needs into platform improvements and feature requirements.

Skills & Experience

  • High agency individual looking to own customer success and influence training platform architecture.
  • 4+ years of ML engineering experience with focus on training large-scale models and distributed systems.
  • Expert-level PyTorch experience including distributed training, DDP implementation, and multi-GPU optimization.
  • Production experience with distributed training techniques including data parallelism, model parallelism, pipeline parallelism.
  • Strong understanding of gradient synchronization and communication strategies for multi-node training.
  • Hands-on experience with large dataset handling and efficient data loading at scale.
  • Proficiency in training infrastructure tools such as Megatron-LM, DeepSpeed, FairScale, or similar frameworks.
  • Excellent communication and teaching skills with ability to explain complex technical concepts to diverse audiences.
  • Customer-facing experience in technical consulting, solutions engineering, or developer relations roles.
  • Experience with RLHF and fine-tuning pipelines including reward model training and post-training optimization.
  • Understanding of reasoning techniques including Chain-of-Thought prompting and advanced reasoning workflows.
  • Nice to have

    Large-scale pre-training experience (7B+ parameters), advanced reasoning implementation (Tree-of-Thought, self-consistency), DPO and constitutional AI expertise, open-source contributions to training frameworks, conference speaking or technical evangelism experience.

    Benefits

  • Join the ground floor of a mission-driven AI startup revolutionizing compute infrastructure.
  • Work with a high-caliber, globally distributed team backed by major VCs.
  • Competitive compensation and benefits.
  • Fast-paced, flexible work environment with room for ownership and impact.
  • Hybrid model : 3 days in-office, 2 days remote with flexibility to work remotely for part of the year.
  • In case you got any questions about the role please reach out to hiring manager on linkedin or X.

    Create a job alert for this search

    Senior Ml Engineer • Seattle, WA, US

    Related jobs
    Senior Machine Learning Engineer, Distribution and Supply

    Senior Machine Learning Engineer, Distribution and Supply

    Expedia Group • Seattle, WA, US
    Full-time
    Expedia Group brands power global travel for everyone, everywhere.We design cutting-edge tech to make travel smoother and more memorable, and we create groundbreaking solutions for our partners.Our...Show more
    Last updated: 4 hours ago • Promoted • New!
    Senior Manager, AI / ML Platform

    Senior Manager, AI / ML Platform

    VirtualVocations • Seattle, Washington, United States
    Full-time
    A company is looking for a Senior Manager, Artificial Intelligence - Machine Learning Platform.Key Responsibilities Lead the strategic direction, development, and continuous improvement of the AI...Show more
    Last updated: 3 days ago • Promoted
    ML Research Engineer, ML Systems

    ML Research Engineer, ML Systems

    Scale AI, Inc. • Seattle, WA, United States
    Full-time
    Scale's ML platform (RLXF) team builds our internal distributed framework for large language model training and inference. The platform has been powering MLEs, researchers, data scientists and opera...Show more
    Last updated: 30+ days ago • Promoted
    Senior ML Systems Engineer

    Senior ML Systems Engineer

    VirtualVocations • Renton, Washington, United States
    Full-time
    A company is looking for a Senior ML Systems Engineer.Key Responsibilities Collaborate across teams to distill product requirements into actionable software requirements Lead software architectu...Show more
    Last updated: 2 days ago • Promoted
    Sales Training Manager

    Sales Training Manager

    VirtualVocations • Renton, Washington, United States
    Full-time
    A company is looking for a Sales Training & Enablement Manager to design and improve sales onboarding and training programs. Key Responsibilities Design and deliver onboarding experiences for all ...Show more
    Last updated: 30+ days ago • Promoted
    Senior ML Engineer

    Senior ML Engineer

    VirtualVocations • Renton, Washington, United States
    Full-time
    A company is looking for a Senior ML Engineer - Personalisation.Key Responsibilities Develop, deploy, and iterate on scalable, real-time Next Best Action (NBA) and ranking models Design and impl...Show more
    Last updated: 30+ days ago • Promoted
    Senior MLOps Engineer

    Senior MLOps Engineer

    VirtualVocations • Seattle, Washington, United States
    Full-time
    A company is looking for a Senior MLOps Engineer - Personalisation.Key Responsibilities Own and evolve the end-to-end ML lifecycle, including data ingestion, feature engineering, model training, ...Show more
    Last updated: 30+ days ago • Promoted
    Cybersecurity LLM Trainer

    Cybersecurity LLM Trainer

    VirtualVocations • Seattle, Washington, United States
    Full-time
    A company is looking for a Cybersecurity Freelancer - LLM Trainer (Remote).Key Responsibilities Generate prompts that challenge AI models Define scoring criteria to evaluate AI responses Correc...Show more
    Last updated: 3 days ago • Promoted
    WFM Implementation Training Specialist

    WFM Implementation Training Specialist

    VirtualVocations • Renton, Washington, United States
    Part-time
    A company is looking for a WFM Implementation Training Specialist.Key Responsibilities Prepare and facilitate training courses for the Client's Workforce Manager system Collaborate with the Clie...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Engineering Manager

    Machine Learning Engineering Manager

    VirtualVocations • Renton, Washington, United States
    Full-time
    A company is looking for a Manager, ML Engineering to lead efforts in Accelerated Machine Learning tools and Analytics Pipelines on GPUs. Key Responsibilities Lead, mentor, and grow the engineerin...Show more
    Last updated: 30+ days ago • Promoted
    Senior Manager, ML Platform

    Senior Manager, ML Platform

    VirtualVocations • Seattle, Washington, United States
    Full-time
    Key Responsibilities Mature and deliver a vision for the unification of ML practices across the organization Build systems that support analytics production at scale and own the deployment of ML...Show more
    Last updated: 3 days ago • Promoted
    Release Train Engineer Specialist

    Release Train Engineer Specialist

    VirtualVocations • Seattle, Washington, United States
    Full-time
    A company is looking for a Release Train Engineer Specialist to oversee the delivery of data product releases in an agile framework. Key Responsibilities Plan, schedule, and oversee the delivery o...Show more
    Last updated: 21 hours ago • Promoted • New!
    Staff Machine Learning Engineer

    Staff Machine Learning Engineer

    VirtualVocations • Seattle, Washington, United States
    Full-time
    A company is looking for a Staff Machine Learning Engineer - Wildfire.Key Responsibilities Architect and build advanced ML models to predict vegetation and fuel conditions Design and maintain da...Show more
    Last updated: 30+ days ago • Promoted
    MLOps Engineer

    MLOps Engineer

    VirtualVocations • Seattle, Washington, United States
    Full-time
    A company is looking for an MLOps / ML Platform Engineer.Key Responsibilities Design and operate ML infrastructure for high-throughput model workflows Build scalable pipelines for training and e...Show more
    Last updated: 30+ days ago • Promoted
    Sales Enablement Training Manager

    Sales Enablement Training Manager

    VirtualVocations • Renton, Washington, United States
    Full-time
    A company is looking for a Senior Sales Enablement and Training Manager.Key Responsibilities Develop and execute a product / data learning strategy aligned with business outcomes Manage end-to-end...Show more
    Last updated: 1 day ago • Promoted
    Senior Machine Learning Manager

    Senior Machine Learning Manager

    VirtualVocations • Seattle, Washington, United States
    Full-time
    A company is looking for a Senior Machine Learning Manager, Gen AI & Knowledge AI.Key Responsibilities Lead the vision, design, and execution of LLM-powered AI products, including system architec...Show more
    Last updated: 30+ days ago • Promoted
    Senior Machine Learning Engineer

    Senior Machine Learning Engineer

    VirtualVocations • Seattle, Washington, United States
    Full-time
    A company is looking for a Senior Machine Learning Engineer, Security.Key Responsibilities Design, build, and deploy machine learning models to detect and mitigate security threats Develop algor...Show more
    Last updated: 30+ days ago • Promoted
    Lead Machine Learning Engineer

    Lead Machine Learning Engineer

    VirtualVocations • Renton, Washington, United States
    Full-time
    A company is looking for a Lead Machine Learning Engineer.Key Responsibilities Develop and manage ML infrastructure for data engineering, LLM training, and deployment Architect cloud-native solu...Show more
    Last updated: 30+ days ago • Promoted