Talent.com
Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)
Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)Socotra, Inc. • San Francisco, CA, United States
Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

Socotra, Inc. • San Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

Build the Future of Scalable AI at TrueFoundry

At TrueFoundry , we’re redefining how ML teams train, deploy, and scale their models. Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models reliably, and deploy them seamlessly on Kubernetes—with the same muscle as Big Tech.

We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.

What You’ll Work On

  • Write clean, modular, and scalable Python code , with a strong emphasis on reliability and performance.
  • Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
  • Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
  • Build platform for developing, deploying and evaluating agentic applications for our end customers.
  • Help shape internal standards and best practices across the engineering team for high-scale ML workloads.

What We’re Looking For

  • 5+ years of hands-on experience building and deploying ML systems at scale.
  • 5+ years of writing production quality high performance code.
  • Deep experience with multi-GPU / multi-node training , ideally with PyTorch as your primary framework.
  • Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).
  • Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.
  • A pragmatic mindset—you know when to optimize and when to ship.
  • Bonus : Familiarity with open-source LLM training / fine-tuning.
  • Why Join TrueFoundry?

  • Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni .
  • First-hand exposure to building and scaling a deep-tech startup —insights you’ll carry if you want to start your own one day.
  • Be part of a fearlessly experimental culture focused on customer success and long-term impact.
  • Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).

    #J-18808-Ljbffr

    Create a job alert for this search

    Staff Engineer Platform • San Francisco, CA, United States

    Related jobs
    ML Engineer

    ML Engineer

    Phizenix • Menlo Park, CA, United States
    Full-time +1
    Client Opportunity | Through Phizenix.Phizenix, a certified minority and women-led recruiting firm, is hiring on behalf of an innovative generative AI startup that's developing diffusion-based larg...Show more
    Last updated: 30+ days ago • Promoted
    Software Engineer - ML Platform (Staff / Sr Staff)

    Software Engineer - ML Platform (Staff / Sr Staff)

    Equilibrium Energy • San Francisco, CA, United States
    Full-time
    Equilibrium Energy is revolutionizing the clean energy transition by developing innovative grid-scale energy storage solutions. Our technology and market platform helps utilities, independent power ...Show more
    Last updated: 18 days ago • Promoted
    MLE, ML Platform

    MLE, ML Platform

    zaimler • San Mateo, CA, United States
    Full-time
    We're creating the foundation for AI systems that don't just generate, but retrieve, link, and reason over enterprise knowledge. In just over a year, we've begun partnering with Fortune 500 design p...Show more
    Last updated: 18 days ago • Promoted
    Staff Systems Engineer

    Staff Systems Engineer

    Bio-Rad Laboratories • Hercules, CA, United States
    Full-time
    Working within Bio-Rad's Life Science R&D Group as a Systems Engineer, you will take engineering concepts, requirements and transform them into functional prototypes and finished products that impr...Show more
    Last updated: 29 days ago • Promoted
    Training : ML Framework Engineer

    Training : ML Framework Engineer

    OpenAI • San Francisco, CA, United States
    Full-time
    Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate...Show more
    Last updated: 18 days ago • Promoted
    LLM Training Frameworks and Optimization Engineer

    LLM Training Frameworks and Optimization Engineer

    Together AI • San Francisco, CA, United States
    Full-time
    LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer. LLM Training Frameworks and Optimization Engine...Show more
    Last updated: 30+ days ago • Promoted
    Staff ML Engineer — Personalization & Recommendations

    Staff ML Engineer — Personalization & Recommendations

    Quizlet, Inc. • San Francisco, CA, United States
    Full-time
    An educational technology company in San Francisco is seeking an experienced Senior or Staff Machine Learning Engineer to design and build large-scale recommendation systems.The role requires exper...Show more
    Last updated: 1 day ago • Promoted
    Staff / Principal ML Engineer

    Staff / Principal ML Engineer

    Transparent Search Group • San Francisco, CA, United States
    Full-time
    Staff / Principal ML Engineer Predictive Modelling for Alternative Assets.Full-time | Remote (North America) | $240K to $270K + Equity. A fast-growing fintech startup is revolutionizing the valuatio...Show more
    Last updated: 16 days ago • Promoted
    Lead ML Platform Engineer for Distributed Training

    Lead ML Platform Engineer for Distributed Training

    1Five • San Francisco, CA, United States
    Full-time
    A leading technology company in San Francisco is seeking a Staff Software Engineer to lead engineering efforts on ML Infrastructure. The ideal candidate will have deep expertise in model training an...Show more
    Last updated: 12 hours ago • Promoted • New!
    Staff ML Platform Engineer, Scalable Infra & Search

    Staff ML Platform Engineer, Scalable Infra & Search

    Apple Inc. • San Francisco, CA, United States
    Full-time
    A leading technology company seeks a Staff ML Engineer for their Machine Learning Platform Technologies team in San Francisco. The role involves building scalable infrastructures for machine learnin...Show more
    Last updated: 7 days ago • Promoted
    Staff ML Engineer — Personalization & Recommendations

    Staff ML Engineer — Personalization & Recommendations

    Icon Ventures • San Francisco, CA, United States
    Full-time
    A leading educational technology company in San Francisco is seeking a Senior or Staff Machine Learning Engineer to design and implement large-scale recommendation systems.Ideal candidates will hav...Show more
    Last updated: 1 hour ago • Promoted • New!
    Staff ML Engineer

    Staff ML Engineer

    Grindr • San Francisco, CA, United States
    Full-time
    San Francisco or Palo Alto offices (Palo Alto preferred) and will require you to be in the office on Tuesdays and Thursdays. What’s So Interesting About This Role?.At Grindr, we’re at the dawn of an...Show more
    Last updated: 30+ days ago • Promoted
    Senior / Staff ML Optimization Engineer

    Senior / Staff ML Optimization Engineer

    Waabi • San Francisco, CA, United States
    Full-time
    Waabi, founded by AI pioneer and visionary Raquel Urtasun, is an AI company building the next generation of self-driving technology. With a world class team and an innovative approach that unleashes...Show more
    Last updated: 30+ days ago • Promoted
    Founding Engineer, ML Performance & Systems

    Founding Engineer, ML Performance & Systems

    Isotron AI • San Francisco, CA, United States
    Full-time
    We’re an early-stage stealth startup building a new kind of platform for generative media.Our mission is to enable the future of real-time generative applications : we’re building the foundational t...Show more
    Last updated: 30+ days ago • Promoted
    Staff AI / ML Engineer

    Staff AI / ML Engineer

    Sigma Computing • San Francisco, CA, United States
    Full-time
    At Sigma, we're not just adding AI-we're building the future of how people work with data.Our platform already lets users explore billions of rows of data in seconds with a spreadsheet-like interfa...Show more
    Last updated: 30+ days ago • Promoted
    Staff ML Platform Engineer – Ad Tech & MLOps Innovator

    Staff ML Platform Engineer – Ad Tech & MLOps Innovator

    Gamecompanies • San Francisco, CA, United States
    Full-time
    A leading interactive real-time 3D platform is seeking a Staff Machine Learning Engineer in San Francisco, CA.This role involves designing and developing scalable machine learning systems to enhanc...Show more
    Last updated: 12 hours ago • Promoted • New!
    AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Compute

    AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Compute

    Apple • San Francisco, CA, United States
    Full-time
    Apple is where individual imaginations gather together, committing to the values that lead to great work.Every new product we build, service we create, or Apple Store experience we deliver is the r...Show more
    Last updated: 18 days ago • Promoted
    Staff ML Engineer, Product

    Staff ML Engineer, Product

    Rocket Money • San Francisco, CA, United States
    Full-time
    The ideal candidate is local to and interested in working from any of our offices (Silver Spring, NYC, SF, Miami, Denver) 1-2x per week. Rocket Money's mission is to empower people to live their bes...Show more
    Last updated: 30+ days ago • Promoted