Distributed ML Systems Engineer- Inference

Together AISan Francisco, CA, United States

30+ days ago

Job type

Full-time

Job description

About the Role

Together AI is seeking a Distributed ML Systems Engineer to design and build scalable machine learning systems that power our accelerated AI initiatives. This role involves developing large-scale, fault-tolerant distributed systems that handle high-load and high-performance requirements. If you are passionate about designing ML systems that operate at scale and eager to create impactful solutions, we want to hear from you. This position offers the chance to work closely with our AI researchers and infrastructure teams to ensure our systems are robust and efficient. Join us in shaping the future at Together AI!

Responsibilities

Design and build large-scale, distributed machine learning systems that are fault-tolerant and high-performance.
Develop and optimize distributed processing frameworks and storage systems.
Collaborate with researchers, engineers, and product managers to integrate ML systems into our infrastructure.
Conduct architecture and design reviews to ensure best practices in system design.
Implement robust monitoring and logging systems to ensure the health and performance of our ML systems.

Requirements

3+ years of experience in building large-scale, fault-tolerant, high-performance distributed systems.

Strong programming skills in one or more of Python, Go, Rust, or C / C++.

Excellent understanding of low-level operating systems concepts including multi-threading, memory management, networking, and storage, performance, and scale.

Experience with cloud computing platforms (AWS, GCP, Azure etc.) and large-scale infrastructure.

Strong problem-solving skills and ability to work in a fast-paced environment.

Preferred : Experience with Kubernetes

Preferred : Experience with Pytorch

About Together AI

Together AI is a research-drven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society. Together, we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI. Our team has been behind technological advancements such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey to build the next-generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance, and other competitive benefits. The US base salary range for this full-time position is $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level, and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunities to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at

Create a job alert for this search

Ml Engineer • San Francisco, CA, United States

Related jobs

Promoted
New!

Distributed ML Systems Engineer- InferenceSan Francisco

Together AISan Francisco, CA, United States

Full-time

Distributed ML Systems Engineer- Inference.Together AI is seeking a Distributed ML Systems Engineer to design and build scalable machine learning systems that power our accelerated AI initiatives.T...Show moreLast updated: 22 hours ago

Promoted

Distributed LLM Inference Engineer

Anyscale, IncSan Francisco, CA, United States

Full-time

At Anyscale, we're on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We're commercializing Ray, a popular open-source project that'...Show moreLast updated: 30+ days ago

Promoted

ML Engineer

PhizenixMenlo Park, CA, United States

Full-time +1

Client Opportunity | Through Phizenix.Phizenix, a certified minority and women-led recruiting firm, is hiring on behalf of an innovative generative AI startup that's developing diffusion-based larg...Show moreLast updated: 30+ days ago

Promoted

Machine Learning Research Engineer, Enterprise ML Systems

Scale AI, Inc.San Francisco, CA, United States

Full-time

AI is becoming vitally important in every function of our society.At Scale, our mission is to accelerate the development of AI applications. For 9 years, Scale has been the leading AI data foundry, ...Show moreLast updated: 14 days ago

Promoted
New!

LLM / ML Engineer (Inference)

ReductoSan Francisco, CA, United States

Full-time

We would love to meet you if you : .Philosophy : You are your own worst critic.You have a high bar for quality and don't rest until the job is done rightno settling for 90%. We want someone who ships f...Show moreLast updated: 22 hours ago

Promoted
New!

Distributed LLM Inference Engineer

AnyscaleSan Francisco, CA, United States

Full-time

Promoted

ML Research Engineer, ML Systems

Scale AISan Francisco, CA, United States

Full-time

Scale's ML platform (RLXF) team builds our internal distributed framework for large language model training and inference. The platform has been powering MLEs, researchers, data scientists and opera...Show moreLast updated: 30+ days ago

Promoted
New!

Software Engineer, ML Inference, Simulation Infrastructure

WaymoSan Francisco, CA, United States

Full-time

Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...Show moreLast updated: 22 hours ago

Promoted

ML Systems Engineer

GenmoSan Francisco, CA, United States

Full-time

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the bo...Show moreLast updated: 30+ days ago

Promoted

Software Engineer, Systems ML

METAMenlo Park, CA, United States

Full-time

Meta), formerly known as Facebook Inc.When Facebook launched in 2004, it changed the way people connect.Apps and services like Messenger, Instagram, and WhatsApp further empowered billions around t...Show moreLast updated: 30+ days ago

Promoted
New!

Software Engineer - ML / LLM Inference

AlldusSan Francisco, CA, United States

Full-time

Get AI-powered advice on this job and more exclusive features.Direct message the job poster from Alldus.Principal Recruitment Consultant | AI & Machine Learning | Co-organizer of the AI in Action P...Show moreLast updated: 22 hours ago

Promoted

LLM Inference Frameworks and Optimization Engineer

Together AISan Francisco, CA, United States

Full-time

Our mission is to optimize inference frameworks, algorithms, and infrastructure, pushing the boundaries of performance, scalability, and cost-efficiency. We are seeking anInference Frameworks and Op...Show moreLast updated: 30+ days ago

Promoted
New!

ML Engineer [IC3]San Francisco, CA

SourcegraphSan Francisco, CA, United States

Full-time

Our mission at Sourcegraph is to make it so that everyone can code, not just ~0.We are transforming how the world's most important companies build software by industrializing development with AI.To...Show moreLast updated: 22 hours ago

Promoted

Founding Engineer, ML Performance & Systems

Isotron AISan Francisco, CA, United States

Full-time

We’re an early-stage stealth startup building a new kind of platform for generative media.Our mission is to enable the future of real-time generative applications : we’re building the foundational t...Show moreLast updated: 30+ days ago

Promoted

ML Research Engineer, ML Systems

Scale AI, Inc.San Francisco, CA, United States

Full-time

Promoted

ML Ops Engineer

Omni InclusiveSan Leandro, CA, United States

Full-time

ML Ops Engineer to drive the full lifecycle of machine learning solutions-from data exploration and model development to scalable deployment and monitoring. This role bridges the gap between data sc...Show moreLast updated: 30+ days ago

Promoted

Senior Machine Learning Systems Infrastructure Engineer - SIML, ISE

AppleCupertino, CA, United States

Full-time

Do you think Computer Vision and Machine Learning can change the world? Do you think it can transform the way millions of people collect, discover and share the most special moments of their lives?...Show moreLast updated: 5 days ago

Promoted

Senior Applied AI Engineer - ML for Systems & Infrastructure

DatabricksSan Francisco, CA, United States

Full-time

As a Senior Applied AI Engineer at Databricks, you will apply machine learning, scheduling and optimization algorithms to improve the efficiency and performance of our engineering systems and infra...Show moreLast updated: 30+ days ago