Talent.com
Software Engineer, ML Infrastructure - Training Platform

Software Engineer, ML Infrastructure - Training Platform

Scale AI, Inc.San Francisco, CA, US
1 day ago
Job type
  • Full-time
Job description

Scale is seeking an AI / ML Infrastructure Engineer to join our Machine Learning Infrastructure team to develop our Training Platform. In this role, you will collaborate closely with Machine Learning researchers to understand their needs and leverage your expertise and our compute resources to enhance experimentation throughput.

The ideal candidate should possess strong fundamentals in machine learning, backend system design, and prior experience in ML Infrastructure. Comfort with infrastructure, large-scale system design, and diagnosing model performance and system failures is essential.

You will :

  • Build highly available, observable, performant, and cost-effective APIs for model training.
  • Participate in our on-call process to ensure service availability.
  • Manage projects end-to-end, from requirements and scoping to design and implementation, within a collaborative, cross-functional environment.
  • Exercise good judgment in system and tool building, balancing build vs. buy decisions with cost considerations.

Ideally you'd have :

  • 4+ years of experience with machine learning training pipelines or inference services in production.
  • Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
  • Experience developing, deploying, and monitoring complex microservice architectures.
  • Proficiency in Python, Docker, Kubernetes, and Infrastructure as Code (e.g., Terraform).
  • Nice to haves :

  • Experience with LLM inference latency optimization techniques like kernel fusion, quantization, dynamic batching, etc.
  • Experience working with cloud platforms such as AWS or GCP.
  • Compensation packages include base salary, equity, and benefits. The salary range varies by location and other factors. Benefits include health, dental, vision, retirement, learning stipends, and generous PTO. Additional benefits may include commuter stipends.

    Location-specific salary range in San Francisco, New York, Seattle : $160,000 — $225,600 USD.

    Note : Our policy requires a 90-day waiting period before reconsidering candidates for the same role.

    About Us :

    At Scale, we aim to accelerate the transition to AI across industries. Our products power advanced LLMs, generative models, and computer vision models, trusted by leading AI companies and organizations worldwide. We promote an inclusive workplace and are committed to equal opportunity employment. For accommodations during the application process, contact accommodations@scale.com.

    We adhere to the US Department of Labor's Pay Transparency and privacy policies. Personal data collected is used solely for employment-related purposes and managed according to our privacy policy.

    #J-18808-Ljbffr

    Create a job alert for this search

    Software Engineer Ml • San Francisco, CA, US

    Related jobs
    • Promoted
    Senior Engineer, ML Infrastructure

    Senior Engineer, ML Infrastructure

    CoreWeaveSunnyvale, CA, US
    Permanent
    CoreWeave is the AI Hyperscaler™, delivering a cloud platform of cutting edge services powering the next wave of AI.Our technology provides enterprises and leading AI labs with the most perfo...Show moreLast updated: 30+ days ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Hedra, IncSan Francisco, CA, US
    Full-time
    Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show moreLast updated: 1 day ago
    • Promoted
    LLM Training Frameworks and Optimization Engineer

    LLM Training Frameworks and Optimization Engineer

    Together AISan Francisco, CA, US
    Full-time
    We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost-efficiency. LLM Training Frameworks and Optimization Engine...Show moreLast updated: 1 day ago
    • Promoted
    LLM Training Resilience Engineer

    LLM Training Resilience Engineer

    Together AISan Francisco, CA, US
    Full-time
    AI infrastructure development, creating robust platforms and frameworks to support state-of-the-art large-scale machine learning training. We specialize in delivering resilient, high-performance sys...Show moreLast updated: 1 day ago
    • Promoted
    ML Infrastructure Engineer with GCP

    ML Infrastructure Engineer with GCP

    iSoftTek Solutions IncMountain View, CA, US
    Full-time
    Job Title : ML Infrastructure Engineer with GCP.Location : Mountain View, CA [Needs to be onsite for 1 week once in a quarter on your own expenses]. Note : Only PST and MST candidates are required.Expe...Show moreLast updated: 30+ days ago
    • Promoted
    ML Infrastructure Engineer

    ML Infrastructure Engineer

    Cubiq RecruitmentSan Francisco, CA, US
    Full-time
    Senior Consultant | AI / Robotics and Autonomous Systems.This is a high-impact role working close to the GPUs, driving inference, ML Ops, and distributed training at scale.Build and maintain infras...Show moreLast updated: 1 day ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    ZipRecruiterSan Francisco, CA, US
    Full-time
    Machine Learning Engineer, Training Infrastructure.We are looking for an ML Engineer with 3+ years of experience in high-performance computing systems to manage and optimize our computational infra...Show moreLast updated: 1 day ago
    • Promoted
    Senior Software Engineer, ML Training Platform

    Senior Software Engineer, ML Training Platform

    DoorDash USASan Francisco, CA, US
    Full-time
    Senior Software Engineer, ML Training Platform.San Francisco, CA; Sunnyvale, CA; Seattle, WA.DoorDash is building the world’s most reliable on-demand logistics engine. Behind the scenes, our Machine...Show moreLast updated: 1 day ago
    • Promoted
    ML Infrastructure Engineer, Safeguards

    ML Infrastructure Engineer, Safeguards

    AnthropicSan Francisco, CA, US
    Full-time
    Anthropic’s mission is to create reliable, interpretable, and steerable AI systems.We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group ...Show moreLast updated: 1 day ago
    • Promoted
    ML Infrastructure Engineer

    ML Infrastructure Engineer

    PhizenixMenlo Park, CA, US
    Full-time +1
    Menlo Park, CA | On-Site | Full-Time / Direct Hire.Looking for ML Infra experts (Bay Area preferred) with deep experience in CUDA, GPU optimization, VLLMs, and LLM inference—pure language focus...Show moreLast updated: 30+ days ago
    • Promoted
    Software Engineer, AI Training and Infrastructure

    Software Engineer, AI Training and Infrastructure

    Skild AISan Francisco, CA, US
    Full-time
    At Skild AI, we are building the world's first general purpose robotic intelligence that is robust and adapts to unseen scenarios without failing. We believe massive scale through data-driven machin...Show moreLast updated: 1 day ago
    • Promoted
    Software Engineer, AI Training and Infrastructure • Pittsburgh, San Francisco, Bengaluru

    Software Engineer, AI Training and Infrastructure • Pittsburgh, San Francisco, Bengaluru

    READ MORESan Francisco, CA, US
    Full-time
    Software Engineer, AI Training and Infrastructure.At Skild AI, we are building the world's first general purpose robotic intelligence that is robust and adapts to unseen scenarios without failing.W...Show moreLast updated: 1 day ago
    • Promoted
    Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

    Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

    Socotra, Inc.San Francisco, CA, US
    Full-time
    Build the Future of Scalable AI at TrueFoundry.ML teams train, deploy, and scale their models.Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models rel...Show moreLast updated: 1 day ago
    • Promoted
    ML Infrastructure Engineer

    ML Infrastructure Engineer

    Symbolica AISan Francisco, CA, US
    Full-time
    Symbolica is an AI research lab pioneering the application of category theory to enable logical reasoning in machines.We're a well-resourced, nimble team of experts on a mission to bridge the g...Show moreLast updated: 30+ days ago
    • Promoted
    Research Engineer, Training Infrastructure

    Research Engineer, Training Infrastructure

    GoodfireSan Francisco, CA, US
    Full-time
    Behind our name : Like fire, AI holds the potential for both immense benefit and significant risk.Just as mastering fire transformed human history, we believe the safe and intentional development of...Show moreLast updated: 1 day ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    HedraSan Francisco, CA, US
    Full-time
    Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show moreLast updated: 1 day ago
    • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Ipro Networks Pte. Ltd.San Francisco, CA, US
    Full-time
    Job Title : Machine Learning Engineer, Training Infrastructure | Position Type : Full time | Location : San Francisco, CA, USA | Salary Range : $150,000 - $250,000 (USD) | Job ID# : 158135.Design, imple...Show moreLast updated: 1 day ago
    • Promoted
    ML Infrastructure Engineer San Francisco, US

    ML Infrastructure Engineer San Francisco, US

    SymbolicaSan Francisco, CA, US
    Full-time
    AI research lab pioneering the application of category theory to enable logical reasoning in machines.We’re a well-resourced, nimble team of experts on a mission to bridge the gap between theoretic...Show moreLast updated: 1 day ago