Talent.com
Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

Socotra, Inc.San Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

Build the Future of Scalable AI at TrueFoundry

At TrueFoundry , we’re redefining how ML teams train, deploy, and scale their models. Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models reliably, and deploy them seamlessly on Kubernetes—with the same muscle as Big Tech.

We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.

What You’ll Work On

  • Write clean, modular, and scalable Python code , with a strong emphasis on reliability and performance.
  • Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
  • Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
  • Build platform for developing, deploying and evaluating agentic applications for our end customers.
  • Help shape internal standards and best practices across the engineering team for high-scale ML workloads.

What We’re Looking For

  • 5+ years of hands-on experience building and deploying ML systems at scale.
  • 5+ years of writing production quality high performance code.
  • Deep experience with multi-GPU / multi-node training , ideally with PyTorch as your primary framework.
  • Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).
  • Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.
  • A pragmatic mindset—you know when to optimize and when to ship.
  • Bonus : Familiarity with open-source LLM training / fine-tuning.
  • Why Join TrueFoundry?

  • Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni .
  • First-hand exposure to building and scaling a deep-tech startup —insights you’ll carry if you want to start your own one day.
  • Be part of a fearlessly experimental culture focused on customer success and long-term impact.
  • Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).

    #J-18808-Ljbffr

    Create a job alert for this search

    Staff Engineer Platform • San Francisco, CA, United States

    Related jobs
    • Promoted
    Sr. Staff ML Platform Engineer (TLM)

    Sr. Staff ML Platform Engineer (TLM)

    EarninMountain View, California, United States
    Full-time
    As one of the first pioneers of earned wage access, our passion at EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to pay...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    ML Systems Engineer : Distributed LLM Training & Inference

    ML Systems Engineer : Distributed LLM Training & Inference

    Scale AISan Francisco, CA, United States
    Full-time
    A leading AI technology company in San Francisco seeks a team member to build and optimize a machine learning framework for large language models. Candidates should have system optimization experien...Show moreLast updated: less than 1 hour ago
    • Promoted
    AIML - Staff ML System Engineer, ML Platform Technologies (MLPT)

    AIML - Staff ML System Engineer, ML Platform Technologies (MLPT)

    AppleSanta Clara, CA, United States
    Full-time
    Santa Clara, California, United States Machine Learning and AI.We're searching for strong machine learning engineers to help build next-generation platform for training deep learning models at scal...Show moreLast updated: 30+ days ago
    • Promoted
    Senior / Staff Machine Learning Engineer - Prediction & Behavior ML

    Senior / Staff Machine Learning Engineer - Prediction & Behavior ML

    ZooxFoster City, California, United States
    Full-time
    The Prediction & Behavior ML team is responsible for developing machine-learned models that understand the full scene around our vehicle and forecast the behavior for other agents, our own vehicle’...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    ML Platform Engineer — Scale AI for Enterprises

    ML Platform Engineer — Scale AI for Enterprises

    CerebrasSan Mateo, CA, United States
    Full-time
    A technology company specializing in AI in San Mateo, CA seeks a Machine Learning Engineer to develop and deploy ML models and collaborate with cross-functional teams. Ideal candidates have a Bachel...Show moreLast updated: 19 hours ago
    • Promoted
    ML Research Engineer - Training

    ML Research Engineer - Training

    AchiraSan Francisco, CA, United States
    Full-time
    Join a world‑class team of scientists, ML researchers, and engineers working together to make the physical microcosm predictable and reshape the future of drug discovery. Move beyond the beaten path...Show moreLast updated: 30+ days ago
    • Promoted
    ML Engineer

    ML Engineer

    PhizenixMenlo Park, California, United States
    Full-time +1
    Client Opportunity | Through Phizenix.Phizenix, a certified minority and women-led recruiting firm, is hiring on behalf of an innovative generative AI startup that’s developing diffusion-based larg...Show moreLast updated: 30+ days ago
    • Promoted
    LLM Training Frameworks and Optimization Engineer

    LLM Training Frameworks and Optimization Engineer

    Together AISan Francisco, CA, United States
    Full-time
    LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer. LLM Training Frameworks and Optimization Engine...Show moreLast updated: 30+ days ago
    • Promoted
    Staff Machine Learning Engineer, AI Platform

    Staff Machine Learning Engineer, AI Platform

    General MotorsSunnyvale, CA, United States
    Full-time
    Remote : This role is based remotely but if you live within a 50-mile radius of Mountain View, you are expected to report to that location three times a week, at minimum. We are seeking an experience...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Staff ML Platform Engineer, Scalable Infra & Search

    Staff ML Platform Engineer, Scalable Infra & Search

    Apple Inc.San Francisco, CA, United States
    Full-time
    A leading technology company seeks a Staff ML Engineer for their Machine Learning Platform Technologies team in San Francisco. The role involves building scalable infrastructures for machine learnin...Show moreLast updated: 19 hours ago
    • Promoted
    Staff ML Engineer

    Staff ML Engineer

    GrindrSan Francisco, CA, United States
    Full-time
    San Francisco or Palo Alto offices (Palo Alto preferred) and will require you to be in the office on Tuesdays and Thursdays. What’s So Interesting About This Role?.At Grindr, we’re at the dawn of an...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    AIML- Staff ML Engineer, Machine Learning Platform Technologies

    AIML- Staff ML Engineer, Machine Learning Platform Technologies

    Apple Inc.San Francisco, CA, United States
    Full-time
    AIML- Staff ML Engineer, Machine Learning Platform Technologies.San Francisco Bay Area, California, United States Machine Learning and AI. Do you want to make Apple products more intelligent for our...Show moreLast updated: 19 hours ago
    • Promoted
    ML Research Engineer, ML Systems

    ML Research Engineer, ML Systems

    Scale AI, Inc.San Francisco, CA, United States
    Full-time
    Scale's ML platform (RLXF) team builds our internal distributed framework for large language model training and inference. The platform has been powering MLEs, researchers, data scientists and opera...Show moreLast updated: 30+ days ago
    • Promoted
    Training : ML Framework Engineer

    Training : ML Framework Engineer

    OpenAISan Francisco, CA, United States
    Full-time
    Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier‑scale model runs. With a dual mandate to accelerate...Show moreLast updated: 22 days ago
    • Promoted
    Staff ML Engineer - Infrastructure

    Staff ML Engineer - Infrastructure

    ChipStackSan Jose, California, United States
    Full-time
    Chips are at the center of today's tech-driven world.But how we design them has not changed in decades, while their complexity and specialization have skyrocketed due to increasing performance dema...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    AIML- Staff ML Engineer, Machine Learning Platform Technologies

    AIML- Staff ML Engineer, Machine Learning Platform Technologies

    AppleSan Francisco, CA, United States
    Full-time
    Do you want to make Apple products more intelligent for our users? The AIML Information Intelligence teams are building groundbreaking technology for algorithmic search and recommendation, machine ...Show moreLast updated: 17 hours ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Intellipro GroupSan Francisco, California, United States
    Full-time
    Machine Learning Engineer, Training Infrastructure.We are looking for an ML Engineer with .ML workloads at scale, supporting our 3DVAE and video diffusion models. We encourage you to apply even if y...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Tech Lead for Distributed ML Systems & Training Platform

    Tech Lead for Distributed ML Systems & Training Platform

    Scale AISan Francisco, CA, United States
    Full-time
    A leading AI technology firm in New York is seeking a talented individual to build and optimize their training and inference frameworks for large language models. The ideal candidate will collaborat...Show moreLast updated: less than 1 hour ago