Talent.com
Machine Learning Engineer, Training Infrastructure
Machine Learning Engineer, Training InfrastructureHedra • San Francisco, CA, United States
Machine Learning Engineer, Training Infrastructure

Machine Learning Engineer, Training Infrastructure

Hedra • San Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

Overview

We are looking for an ML Engineer with 3+ YOE in high-performance computing systems to manage and optimize our computational infrastructure for training and deploying our machine learning models. The ideal candidate has diverse experience managing ML workloads at scale, supporting our 3DVAE and video diffusion models. We encourage you to apply even if you don’t meet every requirement — we value curiosity, creativity, and the drive to solve hard problems.

Responsibilities

  • Design, implement, and maintain scalable computing solutions for training and deploying ML models, ensuring infrastructure can handle large video datasets.
  • Manage and optimize the performance of our computing clusters or cloud instances, such as AWS or Google Cloud, to support distributed training.
  • Ensure that our infrastructure can handle the resource-intensive tasks associated with training large generative models.
  • Monitor system performance and implement improvements to maximize efficiency and utilization, using tools like Airflow for orchestration.
  • Collaborate across research teams to understand their computational needs and provide appropriate solutions, facilitating seamless model deployment.

Qualifications

  • Bachelor’s degree in Computer Science, Information Technology, or a related field, with a focus on system administration.
  • Experience with cloud computing platforms such as Amazon Web Services, Google Cloud, or Microsoft Azure, essential for managing large-scale ML workloads.
  • Values engineering processes and version control (CI / CD).
  • Knowledge of containerization technologies like Docker and Kubernetes required for deployments at scale.
  • Understanding of distributed training techniques and how to scale models across multi-node clusters aligning with video generation needs.
  • Strong problem-solving and communication skills, given the need to collaborate with diverse teams.
  • Benefits

  • Competitive compensation + equity
  • 401k (no match)
  • Healthcare (Silver PPO Medical, Vision, Dental)
  • Lunch and snacks at the office
  • Location : San Francisco, CA

    #J-18808-Ljbffr

    Create a job alert for this search

    Machine Learning Engineer • San Francisco, CA, United States

    Related jobs
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Ambience Healthcare, Inc. • San Francisco, CA, United States
    Full-time
    Ambience Healthcare is the leading AI platform for documentation, coding, and clinical workflow, built to reduce administrative burden and protect revenue integrity at the point of care.Trusted by ...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Engineer, Distributed Training, Optimus

    Machine Learning Engineer, Distributed Training, Optimus

    Tesla Motors, Inc. • Palo Alto, CA, United States
    Full-time
    As a Software Engineer for the Optimus team, you will build the tools and infrastructure to make and measure improvements to neural network architecture, visualize data, assist with exporting and d...Show more
    Last updated: 1 day ago • Promoted
    Machine Learning Infrastructure and Data Engineer

    Machine Learning Infrastructure and Data Engineer

    Apple Inc. • Sunnyvale, CA, United States
    Full-time
    Machine Learning Infrastructure and Data Engineer.Sunnyvale, California, United States.Want to ship amazing experiences in Apple products? Be part of the team in the Video Computer Vision (VCV) org...Show more
    Last updated: 17 days ago • Promoted
    Machine Learning Engineer, GenAI Applied ML

    Machine Learning Engineer, GenAI Applied ML

    Scale AI, Inc. • San Francisco, CA, United States
    Full-time
    At Scale AI, our mission is to accelerate the development of AI applications.For 8 years, Scale has been the leading AI data foundry, helping fuel the most exciting advancements in AI, including : g...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    ZipRecruiter • San Francisco, CA, United States
    Full-time
    Machine Learning Engineer, Training Infrastructure.We are looking for an ML Engineer with 3+ years of experience in high-performance computing systems to manage and optimize our computational infra...Show more
    Last updated: 9 days ago • Promoted
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Ambience Healthcare • San Francisco, CA, United States
    Full-time
    Machine Learning Infrastructure Engineer.Machine Learning Infrastructure Engineer.Machine Learning Infrastructure Engineer. Machine Learning Infrastructure Engineer.Ambience Healthcare is the leadin...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Greylock Partners • San Francisco, CA, United States
    Full-time
    Machine Learning Infrastructure Engineer — join early B2C investment to help build large-scale ML infrastructure for a cutting-edge AI-first mobile product. Founders have experience building iconic ...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Abridge • San Francisco, CA, United States
    Full-time
    Abridge was founded in 2018 with the mission of powering deeper understanding in healthcare.Our AI-powered platform was purpose-built for medical conversations, improving clinical documentation eff...Show more
    Last updated: 3 days ago • Promoted
    Software Engineer, ML Infrastructure - Training Platform

    Software Engineer, ML Infrastructure - Training Platform

    Scale AI, Inc. • San Francisco, CA, United States
    Full-time
    Scale is looking for an AI / ML Infrastructure Engineer to join our Machine Learning Infrastructure team to build out our Training Platform. You will partner closely with Machine Learning researchers ...Show more
    Last updated: 30+ days ago • Promoted
    Founding Machine Learning Infrastructure Engineer

    Founding Machine Learning Infrastructure Engineer

    NomadicML Inc. • San Francisco, CA, United States
    Full-time
    Harvard, where they both did research in the intersection of computation and evaluations.Between them, they have authored multiple published papers in the machine learning domain and hold numerous ...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Engineer - Infrastructure

    Machine Learning Engineer - Infrastructure

    Nextdoor • San Francisco, CA, United States
    Full-time
    Get AI-powered advice on this job and more exclusive features.Neighbors, public agencies, and businesses use Nextdoor to connect around local information that matters in more than 340,000 neighborh...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Engineer, Simulation Realism

    Machine Learning Engineer, Simulation Realism

    Waymo • Mountain View, CA, United States
    Full-time
    Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show more
    Last updated: 24 days ago • Promoted
    Machine Learning Engineer - Training & Infrastructure

    Machine Learning Engineer - Training & Infrastructure

    P-1 AI • San Francisco, CA, United States
    Full-time
    We are building an engineering AGI.We founded P-1 AI with the conviction that the greatest impact of artificial intelligence will be on the built world—helping mankind conquer nature and bend it to...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    IntelliPro Group Inc. • San Francisco, CA, US
    Full-time
    Quick Apply
    Machine Learning Engineer, Training Infrastructure Position Type : Full time Location : San Francisco, CA, USA Salary Range : $150,000 - $250, 000 (USD) Job ID# : 158135 Job Description : We are l...Show more
    Last updated: 26 days ago
    Staff Machine Learning Engineer, ML Performance & Optimization

    Staff Machine Learning Engineer, ML Performance & Optimization

    Waymo • Mountain View, CA, United States
    Full-time
    Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show more
    Last updated: 1 day ago • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Hedra, Inc • San Francisco, CA, United States
    Full-time
    Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show more
    Last updated: 30+ days ago • Promoted
    Staff Machine Learning Engineer, ML Infrastructure (Predictive Planner)

    Staff Machine Learning Engineer, ML Infrastructure (Predictive Planner)

    Waymo • Mountain View, CA, United States
    Full-time
    Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show more
    Last updated: 5 days ago • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Ipro Networks Pte. Ltd. • San Francisco, CA, United States
    Full-time
    Job Title : Machine Learning Engineer, Training Infrastructure | Position Type : Full time | Location : San Francisco, CA, USA | Salary Range : $150,000 - $250,000 (USD) | Job ID# : 158135.Design, imple...Show more
    Last updated: 24 days ago • Promoted
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Character.AI • San Francisco, CA, United States
    Full-time
    Machine Learning Infrastructure Engineer.Machine Learning Infrastructure Engineer.Machine Learning Infrastructure Engineer. Machine Learning Infrastructure Engineer.Get AI-powered advice on this job...Show more
    Last updated: 30+ days ago • Promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Hedra • San Francisco, CA, United States
    Full-time
    Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...Show more
    Last updated: 30+ days ago • Promoted