Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)Socotra, Inc. • San Francisco, CA, United States

Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

Socotra, Inc. • San Francisco, CA, United States

30+ days ago

Job type

Full-time

Job description

Build the Future of Scalable AI at TrueFoundry

At TrueFoundry , we’re redefining how ML teams train, deploy, and scale their models. Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models reliably, and deploy them seamlessly on Kubernetes—with the same muscle as Big Tech.

We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.

What You’ll Work On

Write clean, modular, and scalable Python code , with a strong emphasis on reliability and performance.
Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
Build platform for developing, deploying and evaluating agentic applications for our end customers.
Help shape internal standards and best practices across the engineering team for high-scale ML workloads.

What We’re Looking For

5+ years of hands-on experience building and deploying ML systems at scale.

5+ years of writing production quality high performance code.

Deep experience with multi-GPU / multi-node training , ideally with PyTorch as your primary framework.

Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).

Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.

A pragmatic mindset—you know when to optimize and when to ship.

Bonus : Familiarity with open-source LLM training / fine-tuning.

Why Join TrueFoundry?

Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni .

First-hand exposure to building and scaling a deep-tech startup —insights you’ll carry if you want to start your own one day.

Be part of a fearlessly experimental culture focused on customer success and long-term impact.

Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).

#J-18808-Ljbffr

Create a job alert for this search

Staff Engineer Platform • San Francisco, CA, United States

Related jobs

Staff ML Platform Engineer Large Scale Training (LLMOps / MLOps)

Socotra • San Francisco, CA, United States

Full-time

Build the Future of Scalable AI at TrueFoundry.ML teams train, deploy, and scale their models.Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models rel...Show more

Last updated: 17 days ago • Promoted

Staff ML Engineer

Grindr LLC • San Francisco, CA, United States

Full-time

This is a hybrid role based in our San Francisco or Palo Alto offices (Palo Alto preferred) and will require you to be in the office on Tuesdays and Thursdays. What’s So Interesting About This Role?...Show more

Last updated: 16 hours ago • Promoted • New!

MLE, ML Platform

zaimler • San Mateo, CA, United States

Full-time

We're creating the foundation for AI systems that don't just generate, but retrieve, link, and reason over enterprise knowledge. In just over a year, we've begun partnering with Fortune 500 design p...Show more

Last updated: 17 days ago • Promoted

ML Infrastructure Engineer — Scalable Training for GenAI

Hedra, Inc • San Francisco, CA, United States

Full-time

A pioneering generative media company is seeking an ML Engineer in San Francisco.The ideal candidate will have 3+ years of experience in high-performance computing and manage infrastructure for mac...Show more

Last updated: 1 day ago • Promoted

Senior ML Platform Engineer (Hybrid) – Scale Models

Turo • San Francisco, CA, United States

Full-time

A leading car sharing platform in San Francisco is seeking a Senior Software Engineer to build a platform for deploying machine learning models. The ideal candidate will have over 7 years of experie...Show more

Last updated: 2 days ago • Promoted

Staff Systems Engineer

Bio-Rad Laboratories • Hercules, CA, United States

Full-time

Working within Bio-Rad's Life Science R&D Group as a Systems Engineer, you will take engineering concepts, requirements and transform them into functional prototypes and finished products that impr...Show more

Last updated: 28 days ago • Promoted

Training : ML Framework Engineer

OpenAI • San Francisco, CA, United States

Full-time

Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate...Show more

Last updated: 17 days ago • Promoted

Training : ML Framework Engineer

The Rundown AI, Inc. • San Francisco, CA, United States

Full-time

Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier‑scale model runs. With a dual mandate to accelerate...Show more

Last updated: 16 hours ago • Promoted • New!

LLM Training Frameworks and Optimization Engineer

Together AI • San Francisco, CA, United States

Full-time

LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer. LLM Training Frameworks and Optimization Engine...Show more

Last updated: 30+ days ago • Promoted

ML Engineer

Phizenix • Menlo Park, California, United States

Full-time +1

Client Opportunity | Through Phizenix.Phizenix, a certified minority and women-led recruiting firm, is hiring on behalf of an innovative generative AI startup that’s developing diffusion-based larg...Show more

Last updated: 30+ days ago • Promoted

Senior ML Platform Engineer - Training & Inference

Zoox • Foster City, CA, United States

Full-time

A tech company specializing in autonomous vehicles is seeking an experienced ML Infrastructure Engineer to build scalable ML training frameworks and lead the design of a robust ML platform.Candidat...Show more

Last updated: 4 days ago • Promoted

Senior Staff ML Engineer, Recommendations Systems

Grow Therapy • San Francisco, California, USA

Full-time

Grow Therapy is on a mission to serve as the trusted partner for therapists growing their practice and patients accessing high-quality care. Powered by technology we are a three-sided marketplace th...Show more

Last updated: 12 days ago • Promoted

Staff ML Platform Engineer, Scalable Infra & Search

Apple Inc. • San Francisco, CA, United States

Full-time

A leading technology company seeks a Staff ML Engineer for their Machine Learning Platform Technologies team in San Francisco. The role involves building scalable infrastructures for machine learnin...Show more

Last updated: 6 days ago • Promoted

Staff ML Engineer

Grindr • San Francisco, CA, United States

Full-time

San Francisco or Palo Alto offices (Palo Alto preferred) and will require you to be in the office on Tuesdays and Thursdays. What’s So Interesting About This Role?.At Grindr, we’re at the dawn of an...Show more

Last updated: 30+ days ago • Promoted

Machine Learning Engineer, Distributed & Scalable Training

Lila Sciences • San Francisco, California, United States

Full-time

We’re seeking a ML Engineer specializing in.You’ll design and maintain large-scale training systems, optimize performance for massive models, and integrate cutting-edge techniques to improve effici...Show more

Last updated: 11 days ago • Promoted

Staff ML Engineer - Hybrid, Equity, Scale ML Platform

Turo Inc • San Francisco, CA, United States

Full-time

A leading car sharing platform in San Francisco is seeking a Staff Software Engineer to integrate machine learning models into their product experience. You will collaborate with various teams, buil...Show more

Last updated: 1 day ago • Promoted

AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Compute

Apple • San Francisco, CA, United States

Full-time

Apple is where individual imaginations gather together, committing to the values that lead to great work.Every new product we build, service we create, or Apple Store experience we deliver is the r...Show more

Last updated: 17 days ago • Promoted

Machine Learning Engineer, Training Infrastructure

Intellipro Group • San Francisco, California, United States

Full-time

Machine Learning Engineer, Training Infrastructure.We are looking for an ML Engineer with .ML workloads at scale, supporting our 3DVAE and video diffusion models. We encourage you to apply even if y...Show more

Last updated: 30+ days ago • Promoted