Talent.com
Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)
Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)Socotra, Inc. • San Francisco, CA, United States
Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

Socotra, Inc. • San Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

Build the Future of Scalable AI at TrueFoundry

At TrueFoundry , we’re redefining how ML teams train, deploy, and scale their models. Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models reliably, and deploy them seamlessly on Kubernetes—with the same muscle as Big Tech.

We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.

What You’ll Work On

  • Write clean, modular, and scalable Python code , with a strong emphasis on reliability and performance.
  • Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
  • Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
  • Build platform for developing, deploying and evaluating agentic applications for our end customers.
  • Help shape internal standards and best practices across the engineering team for high-scale ML workloads.

What We’re Looking For

  • 5+ years of hands-on experience building and deploying ML systems at scale.
  • 5+ years of writing production quality high performance code.
  • Deep experience with multi-GPU / multi-node training , ideally with PyTorch as your primary framework.
  • Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).
  • Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.
  • A pragmatic mindset—you know when to optimize and when to ship.
  • Bonus : Familiarity with open-source LLM training / fine-tuning.
  • Why Join TrueFoundry?

  • Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni .
  • First-hand exposure to building and scaling a deep-tech startup —insights you’ll carry if you want to start your own one day.
  • Be part of a fearlessly experimental culture focused on customer success and long-term impact.
  • Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).

    #J-18808-Ljbffr

    Create a job alert for this search

    Staff Engineer Platform • San Francisco, CA, United States

    Related jobs
    Staff ML Platform Engineer Large Scale Training (LLMOps / MLOps)

    Staff ML Platform Engineer Large Scale Training (LLMOps / MLOps)

    Socotra • San Francisco, CA, United States
    Full-time
    Build the Future of Scalable AI at TrueFoundry.ML teams train, deploy, and scale their models.Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models rel...Show more
    Last updated: 17 days ago • Promoted
    Staff ML Engineer

    Staff ML Engineer

    Grindr LLC • San Francisco, CA, United States
    Full-time
    This is a hybrid role based in our San Francisco or Palo Alto offices (Palo Alto preferred) and will require you to be in the office on Tuesdays and Thursdays. What’s So Interesting About This Role?...Show more
    Last updated: 16 hours ago • Promoted • New!
    MLE, ML Platform

    MLE, ML Platform

    zaimler • San Mateo, CA, United States
    Full-time
    We're creating the foundation for AI systems that don't just generate, but retrieve, link, and reason over enterprise knowledge. In just over a year, we've begun partnering with Fortune 500 design p...Show more
    Last updated: 17 days ago • Promoted
    ML Infrastructure Engineer — Scalable Training for GenAI

    ML Infrastructure Engineer — Scalable Training for GenAI

    Hedra, Inc • San Francisco, CA, United States
    Full-time
    A pioneering generative media company is seeking an ML Engineer in San Francisco.The ideal candidate will have 3+ years of experience in high-performance computing and manage infrastructure for mac...Show more
    Last updated: 1 day ago • Promoted
    Senior ML Platform Engineer (Hybrid) – Scale Models

    Senior ML Platform Engineer (Hybrid) – Scale Models

    Turo • San Francisco, CA, United States
    Full-time
    A leading car sharing platform in San Francisco is seeking a Senior Software Engineer to build a platform for deploying machine learning models. The ideal candidate will have over 7 years of experie...Show more
    Last updated: 2 days ago • Promoted
    Staff Systems Engineer

    Staff Systems Engineer

    Bio-Rad Laboratories • Hercules, CA, United States
    Full-time
    Working within Bio-Rad's Life Science R&D Group as a Systems Engineer, you will take engineering concepts, requirements and transform them into functional prototypes and finished products that impr...Show more
    Last updated: 28 days ago • Promoted
    Training : ML Framework Engineer

    Training : ML Framework Engineer

    OpenAI • San Francisco, CA, United States
    Full-time
    Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate...Show more
    Last updated: 17 days ago • Promoted
    Training : ML Framework Engineer

    Training : ML Framework Engineer

    The Rundown AI, Inc. • San Francisco, CA, United States
    Full-time
    Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier‑scale model runs. With a dual mandate to accelerate...Show more
    Last updated: 16 hours ago • Promoted • New!
    LLM Training Frameworks and Optimization Engineer

    LLM Training Frameworks and Optimization Engineer

    Together AI • San Francisco, CA, United States
    Full-time
    LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer. LLM Training Frameworks and Optimization Engine...Show more
    Last updated: 30+ days ago • Promoted
    ML Engineer

    ML Engineer

    Phizenix • Menlo Park, California, United States
    Full-time +1
    Client Opportunity | Through Phizenix.Phizenix, a certified minority and women-led recruiting firm, is hiring on behalf of an innovative generative AI startup that’s developing diffusion-based larg...Show more
    Last updated: 30+ days ago • Promoted
    Senior ML Platform Engineer - Training & Inference

    Senior ML Platform Engineer - Training & Inference

    Zoox • Foster City, CA, United States
    Full-time
    A tech company specializing in autonomous vehicles is seeking an experienced ML Infrastructure Engineer to build scalable ML training frameworks and lead the design of a robust ML platform.Candidat...Show more
    Last updated: 4 days ago • Promoted
    Senior Staff ML Engineer, Recommendations Systems

    Senior Staff ML Engineer, Recommendations Systems

    Grow Therapy • San Francisco, California, USA
    Full-time
    Grow Therapy is on a mission to serve as the trusted partner for therapists growing their practice and patients accessing high-quality care. Powered by technology we are a three-sided marketplace th...Show more
    Last updated: 12 days ago • Promoted
    Staff ML Platform Engineer, Scalable Infra & Search

    Staff ML Platform Engineer, Scalable Infra & Search

    Apple Inc. • San Francisco, CA, United States
    Full-time
    A leading technology company seeks a Staff ML Engineer for their Machine Learning Platform Technologies team in San Francisco. The role involves building scalable infrastructures for machine learnin...Show more
    Last updated: 6 days ago • Promoted
    Staff ML Engineer

    Staff ML Engineer

    Grindr • San Francisco, CA, United States
    Full-time
    San Francisco or Palo Alto offices (Palo Alto preferred) and will require you to be in the office on Tuesdays and Thursdays. What’s So Interesting About This Role?.At Grindr, we’re at the dawn of an...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Engineer, Distributed & Scalable Training

    Machine Learning Engineer, Distributed & Scalable Training

    Lila Sciences • San Francisco, California, United States
    Full-time
    We’re seeking a ML Engineer specializing in.You’ll design and maintain large-scale training systems, optimize performance for massive models, and integrate cutting-edge techniques to improve effici...Show more
    Last updated: 11 days ago • Promoted
    Staff ML Engineer - Hybrid, Equity, Scale ML Platform

    Staff ML Engineer - Hybrid, Equity, Scale ML Platform

    Turo Inc • San Francisco, CA, United States
    Full-time
    A leading car sharing platform in San Francisco is seeking a Staff Software Engineer to integrate machine learning models into their product experience. You will collaborate with various teams, buil...Show more
    Last updated: 1 day ago • Promoted
    AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Compute

    AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Compute

    Apple • San Francisco, CA, United States
    Full-time
    Apple is where individual imaginations gather together, committing to the values that lead to great work.Every new product we build, service we create, or Apple Store experience we deliver is the r...Show more
    Last updated: 17 days ago • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Intellipro Group • San Francisco, California, United States
    Full-time
    Machine Learning Engineer, Training Infrastructure.We are looking for an ML Engineer with .ML workloads at scale, supporting our 3DVAE and video diffusion models. We encourage you to apply even if y...Show more
    Last updated: 30+ days ago • Promoted