Talent.com
Machine Learning Engineer - Training & Infrastructure

Machine Learning Engineer - Training & Infrastructure

P-1 AISan Francisco, CA, United States
21 hours ago
Job type
  • Full-time
Job description

About P-1 AI

We are building an engineering AGI. We founded P-1 AI with the conviction that the greatest impact of artificial intelligence will be on the built worldhelping mankind conquer nature and bend it to our will. Our first product is Archie, an AI engineer capable of quantitative and spatial reasoning over physical product domains that performs at the level of an entry?level design engineer. We aim to put an Archie on every engineering team at every industrial company on earth.

About The Role

Were looking for an experienced engineer to take ownership of LLM training operations across our applied research team. Your focus will be on making large?scale GPU training run reliably, efficiently, and fast on a dedicated mid?size GPU cluster and possibly on cloud platforms as well. Youll work closely with researchers and ML engineers developing new models and agentic systems, ensuring their experiments scale smoothly across multi?node GPU clusters. From debugging NCCL deadlocks to optimizing FSDP configs, youll be the go?to person for training infrastructure and performance.

What Youll Do

  • Own the training pipeline for large?scale LLM fine?tuning and post?training workflows
  • Configure, launch, monitor, and debug multi?node distributed training jobs using FSDP, DeepSpeed, or custom wrappers
  • Contribute to upstream and internal forks of training frameworks like TorchTune, TRL, and Hugging Face Transformers
  • Tune training parameters, memory footprints, and sharding strategies for optimal throughput
  • Work closely with infra and systems teams to maintain the health and utilization of our GPU clusters (e.g., Infiniband, NCCL, Slurm, Kubernetes)
  • Implement features or fixes to unblock novel use cases in our LLM training stack

About You

  • 3+ years working with large?scale ML systems or training pipelines
  • Deep familiarity with PyTorch, especially distributed training via FSDP, DeepSpeed, or DDP
  • Comfortable navigating training libraries like TorchTune, Accelerate, or Trainer APIs
  • Practical experience with multi?node GPU training, including profiling, debugging, and optimizing jobs
  • Understanding of low?level components like NCCL, Infiniband, CUDA memory, and model partitioning strategies
  • You enjoy bridging research and engineeringmaking messy ideas actually run on hardware
  • Nice to Have

  • Experience maintaining Slurm, Ray, or Kubernetes clusters
  • Past contributions to open?source ML training frameworks
  • Exposure to model scaling laws, checkpointing formats (e.g., HF sharded safetensors vs. distcp), or mixed precision training
  • Familiarity with on?policy reinforcement learning setups with inference (policy rollouts) as part of the training loop, such as GRPO, PPO, or A2C
  • Experience working at a startup
  • Interview Process

  • Initial screening Head of Talent (30 mins)
  • Hiring manager interview Head of AI (45 mins)
  • Technical interview AI Chief Scientist and / or Head of AI (45 mins)
  • Culture fit / Q&A (maybe in person) with co?founder & CEO (45 mins)
  • Seniority level

    Mid?Senior level

    Employment type

    Full?time

    Job function

    Engineering and Information Technology

    Industries

    Software Development

    #J-18808-Ljbffr

    Create a job alert for this search

    Machine Learning Engineer • San Francisco, CA, United States

    Related jobs
    • Promoted
    • New!
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Intellipro GroupSan Francisco, CA, United States
    Full-time
    Machine Learning Engineer, Training Infrastructure.We are looking for an ML Engineer with 3+ YOE in high-performance computing systems to manage and optimize our computational infrastructure for tr...Show moreLast updated: 21 hours ago
    • Promoted
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Greylock PartnersSan Francisco, CA, United States
    Full-time
    Machine Learning Infrastructure Engineer — join early B2C investment to help build large-scale ML infrastructure for a cutting-edge AI-first mobile product. Founders have experience building iconic ...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Staff Machine Learning Infrastructure Engineer

    Staff Machine Learning Infrastructure Engineer

    DYNA Robotics IncRedwood City, CA, United States
    Full-time
    Dyna Robotics makes general-purpose robots powered by a proprietary embodied AI foundation model that generalizes and self-improves across varied environments with commercial-grade performance.Dyna...Show moreLast updated: 21 hours ago
    • Promoted
    • New!
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Ambience HealthcareSan Francisco, California, United States
    Full-time
    Ambience Healthcare is the leading AI platform for documentation, coding, and clinical workflow, built to reduce administrative burden and protect revenue integrity at the point of care.Trusted by ...Show moreLast updated: 12 hours ago
    • Promoted
    • New!
    Staff Machine Learning Engineer, Infrastructure

    Staff Machine Learning Engineer, Infrastructure

    WaymoSan Francisco, CA, United States
    Full-time
    Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show moreLast updated: 21 hours ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    HEDRA INCSan Francisco, CA, United States
    Full-time
    Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show moreLast updated: 30+ days ago
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    IntelliPro Group Inc.San Francisco, CA, US
    Full-time
    Quick Apply
    Machine Learning Engineer, Training Infrastructure Position Type : Full time Location : San Francisco, CA, USA Salary Range : $150,000 - $250, 000 (USD) Job ID# : 158135 Job Description : We are l...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Abridge Al, IncSan Francisco, CA, United States
    Full-time
    Abridge was founded in 2018 with the mission of powering deeper understanding in healthcare.Our AI-powered platform was purpose-built for medical conversations, improving clinical documentation eff...Show moreLast updated: 21 hours ago
    • Promoted
    • New!
    Staff Machine Learning Engineer, ML Infrastructure (Predictive Planner)

    Staff Machine Learning Engineer, ML Infrastructure (Predictive Planner)

    WaymoSan Francisco, CA, United States
    Full-time
    Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self?Driving Car Project in 2009, Waymo has focused on buildin...Show moreLast updated: 21 hours ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Hedra, IncSan Francisco, CA, United States
    Full-time
    Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show moreLast updated: 30+ days ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Ipro Networks Pte. Ltd.San Francisco, CA, United States
    Full-time
    Job Title : Machine Learning Engineer, Training Infrastructure | Position Type : Full time | Location : San Francisco, CA, USA | Salary Range : $150,000 - $250,000 (USD) | Job ID# : 158135.Design, imple...Show moreLast updated: 30+ days ago
    • Promoted
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Character.AISan Francisco, CA, United States
    Full-time
    Machine Learning Infrastructure Engineer.Machine Learning Infrastructure Engineer.Machine Learning Infrastructure Engineer. Machine Learning Infrastructure Engineer.Get AI-powered advice on this job...Show moreLast updated: 30+ days ago
    • Promoted
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    CharacterRedwood City, CA, United States
    Full-time
    We're looking for seasoned ML Infrastructure engineers with experience designing, building and maintaining training and serving infrastructure for ML research. Provide infrastructure support to our ...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Character.aiRedwood City, California, United States
    Full-time
    About the role We’re looking for seasoned ML Infrastructure engineers with experience designing, building and maintaining training and serving infrastructure for ML research.Provide infrastructure ...Show moreLast updated: 12 hours ago
    • Promoted
    • New!
    Principle Machine Learning Infrastructure Engineer, Ads

    Principle Machine Learning Infrastructure Engineer, Ads

    RobloxSan Mateo, CA, United States
    Full-time
    Every day, tens of millions of people come to Roblox to explore, create, play, learn, and connect with friends in 3D immersive digital experiences– all created by our global community of developers...Show moreLast updated: 21 hours ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    HedraSan Francisco, CA, United States
    Full-time
    Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Machine Learning Infrastructure Engineers (Multiple Opportunities)

    Machine Learning Infrastructure Engineers (Multiple Opportunities)

    Greylock PartnersSan Francisco, CA, United States
    Full-time
    Get AI-powered advice on this job and more exclusive features.To help support the growth of several investments of ours in SF Bay Area, we're looking to network with talented engineers with strong ...Show moreLast updated: 21 hours ago
    • Promoted
    • New!
    Senior / Staff Machine Learning Infrastructure Engineer

    Senior / Staff Machine Learning Infrastructure Engineer

    Calico LLCSouth San Francisco, CA, United States
    Full-time
    Senior / Staff Machine Learning Infrastructure Engineer.Senior / Staff Machine Learning Infrastructure Engineer.Senior / Staff Machine Learning Infrastructure Engineer. Senior / Staff Machine Learni...Show moreLast updated: 21 hours ago