Talent.com
Cluster Infrastructure Engineer
Cluster Infrastructure EngineerCartesia • San Francisco, CA, United States
Cluster Infrastructure Engineer

Cluster Infrastructure Engineer

Cartesia • San Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

About Cartesia

Our mission is to build the next generation of AI : ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.

We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.

We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.

About the Role

We’re looking for a Cluster Infrastructure Engineer to help build and scale the compute backbone that powers Cartesia’s research on real-time, multimodal intelligence. In this role, you’ll work at the intersection of distributed systems and infrastructure engineering, designing and operating the large-scale GPU clusters that train and serve Cartesia’s foundation models. You’ll own systems that need to be fast, reliable, and highly automated — ensuring our researchers and product teams can move at the speed of innovation. You’ll build the tooling, automation, and monitoring needed to keep clusters resilient under load, quickly diagnose and resolve issues, and continuously push the boundaries of scalability and efficiency.

Your Impact

Design and build large-scale GPU clusters for model training and low-latency inference

Develop automation for provisioning, scaling, and monitoring to ensure clusters are fast, resilient, and self-healing

Collaborate closely with research and product teams to enable distributed training at scale, optimizing for speed, reliability, and utilization

Implement robust observability and alerting systems to monitor GPU health, node stability, and job performance

Diagnose and triage hardware, networking, and distributed training issues across environments, coordinating with provider support as needed

Continuously improve cluster reliability, developer ergonomics, and overall system efficiency across Cartesia’s research and production workloads

What You Bring

Strong engineering fundamentals and experience building and operating large-scale distributed systems

Deep familiarity with GPU cluster management using Kubernetes and Slurm

A blend of developer empathy and raw performance engineering, designing systems and tools that are intuitive to use and fast

Ability to balance principled engineering with the urgency of keeping mission-critical systems alive

Proficiency with Infrastructure-as-Code tools (Terraform, Ansible, etc.) and observability tools (Prometheus, Grafana, etc.)

Strong debugging skills— comfortable diagnosing NCCL issues, CUDA errors, and network or driver-level faults.

What Sets You Apart

Experience optimizing large-scale distributed training frameworks such as DeepSpeed, Megatron-LM, or similar

Familiarity with advanced parallelization techniques such as FSDP, context parallelism, or tensor parallelism

Our culture

🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.

🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.

🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.

#J-18808-Ljbffr

Create a job alert for this search

Infrastructure Engineer • San Francisco, CA, United States

Related jobs
Cloud Infrastructure Engineer

Cloud Infrastructure Engineer

Braintrust • San Francisco, CA, United States
Full-time
Braintrust is building the modern platform for evaluating and deploying AI systems.Our mission is to help enterprises build trust in their AI by making it easy to test, monitor, and improve models ...Show more
Last updated: 30+ days ago • Promoted
Cloud Infrastructure Engineer

Cloud Infrastructure Engineer

Glean.info • San Francisco, CA, United States
Full-time
Glean is the Work AI platform that helps everyone work smarter with AI.What began as the industry's most advanced enterprise search has evolved into a full-scale Work AI ecosystem, powering intelli...Show more
Last updated: 30+ days ago • Promoted
Infrastructure Engineer

Infrastructure Engineer

Monograph • San Francisco, CA, United States
Full-time
Our mission at New Gen is to bend the internet to you.We envision a future where interfaces are personalized and powered by LLMs. We believe all websites and interfaces will eventually incorporate a...Show more
Last updated: 24 days ago • Promoted
Infrastructure Engineer

Infrastructure Engineer

Mercor, Inc. • San Francisco, CA, United States
Full-time
We use our platform to source, vet, and onboard expert contractors who help train AI models in a wide variety of domains. Our technology is so effective it’s used by all of the top 5 AI labs.We scal...Show more
Last updated: 11 days ago • Promoted
Infrastructure Engineer

Infrastructure Engineer

Chalk • San Francisco, CA, United States
Full-time
Chalk is building the data platform that powers the future of machine learning applications.We tear down complexity, latency, and scale barriers that have traditionally constrained ML capabilities....Show more
Last updated: 30+ days ago • Promoted
Enterprise Cloud Infrastructure Engineer

Enterprise Cloud Infrastructure Engineer

Stanford University • Redwood City, CA, United States
Full-time
Build and maintain scalable, highly available, and resilient systems in the cloud and on-prem.Implement any new cloud functionality or migrate existing processes to the cloud and maintain them.Buil...Show more
Last updated: 13 days ago • Promoted
Infrastructure Engineer (Hybrid Cloud & Platform)

Infrastructure Engineer (Hybrid Cloud & Platform)

Aldea Inc • San Francisco, California, United States, 94102
Full-time
Location : US Remote / Bay Area.Aldea is a multi-modal foundational AI company reimagining the scaling laws of intelligence. We believe today's architectures create unnecessary bottlenecks for the ev...Show more
Last updated: 20 hours ago • New!
Platform & Infrastructure Engineer

Platform & Infrastructure Engineer

MindsDB • San Francisco, CA, United States
Full-time
Retrieved from the description.MindsDB is a fast-growing AI startup headquartered in San Francisco, California.MindsDB is an AI Analytics solution that connects to diverse data sources and applicat...Show more
Last updated: 11 days ago • Promoted
Infrastructure Engineer

Infrastructure Engineer

Factory • San Francisco, CA, United States
Full-time
Factory is seeking seasoned Infrastructure Engineers to architect, build, and maintain our cloud infrastructure.Lead the design and implementation of robust, secure, and highly scalable cloud infra...Show more
Last updated: 30+ days ago • Promoted
Cloud Infrastructure Engineer — Kubernetes & Scale

Cloud Infrastructure Engineer — Kubernetes & Scale

OpenAI • San Francisco, CA, United States
Full-time
A leading AI research company in San Francisco is seeking an experienced infrastructure engineer to design and build scalable systems. You will play a vital role in maintaining reliability and secur...Show more
Last updated: 2 days ago • Promoted
Lead Platform Engineer (Network Infrastructure)

Lead Platform Engineer (Network Infrastructure)

Capital One • San Francisco, CA, United States
Full-time +1
Lead Platform Engineer (Network Infrastructure).Do you love building and pioneering in the technology space? Do you enjoy solving complex technical problems in a fast-paced, collaborative, inclusiv...Show more
Last updated: 11 days ago • Promoted
Infrastructure Engineer

Infrastructure Engineer

Mercor • San Francisco, CA, United States
Full-time
Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Mercor is training models that predict how well someone will perform on a job better than a huma...Show more
Last updated: 8 days ago • Promoted
Infrastructure Engineer

Infrastructure Engineer

Workila • San Francisco, CA, United States
Full-time
The sheer scale of our capabilities and client engagements and the way we collaborate, operate and deliver value provides an unparalleled opportunity to grow and advance. Choose Workila, and make de...Show more
Last updated: 30+ days ago • Promoted
Infrastructure Engineer

Infrastructure Engineer

Raydar • San Francisco, CA, United States
Full-time
As they continue to expand their footprint across large organizations, they’re hiring an.You’ll work across product, research, and engineering teams to ensure the company’s infrastructure can suppo...Show more
Last updated: 2 days ago • Promoted
Cloud Infrastructure Engineer

Cloud Infrastructure Engineer

Brain Trust Inc • San Francisco, CA, United States
Full-time
Braintrust is the AI observability platform.By connecting evals and observability in one workflow, Braintrust gives builders the visibility to understand how AI behaves in production and the tools ...Show more
Last updated: 13 days ago • Promoted
Infrastructure Engineer

Infrastructure Engineer

Langchain • San Francisco, CA, United States
Full-time
At LangChain, our mission is to make intelligent agents ubiquitous.We provide the agent engineering platform and open source frameworks developers need to ship reliable agents fast.Our open source ...Show more
Last updated: 14 days ago • Promoted
Cloud Infrastructure Engineer

Cloud Infrastructure Engineer

Florvets Structures • San Francisco, CA, United States
Full-time
Job Title : Cloud Infrastructure Engineer.Florvets Structures is a leading construction and engineering company based in San Francisco, California. We specialize in building innovative and sustainabl...Show more
Last updated: 30+ days ago • Promoted
Forward Deployed Infrastructure Engineer

Forward Deployed Infrastructure Engineer

Hyperbolic Labs • San Francisco, CA, United States
Full-time
Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By making better use of idle computing resources across the globe, w...Show more
Last updated: 30+ days ago • Promoted