Talent.com
Cluster Infrastructure Engineer

Cluster Infrastructure Engineer

Cartesia, Inc.San Francisco, CA, United States
1 day ago
Job type
  • Full-time
Job description

About Cartesia

Our mission is to build the next generation of AI : ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text-1B text tokens, 10B audio tokens and 1T video tokens-let alone do this on-device.

We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.

We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.

About the Role

We're looking for a Cluster Infrastructure Engineer to help build and scale the compute backbone that powers Cartesia's research on real-time, multimodal intelligence. In this role, you'll work at the intersection of distributed systems and infrastructure engineering, designing and operating the large-scale GPU clusters that train and serve Cartesia's foundation models. You'll own systems that need to be fast, reliable, and highly automated - ensuring our researchers and product teams can move at the speed of innovation. You'll build the tooling, automation, and monitoring needed to keep clusters resilient under load, quickly diagnose and resolve issues, and continuously push the boundaries of scalability and efficiency.

Your Impact

  • Design and build large-scale GPU clusters for model training and low-latency inference
  • Develop automation for provisioning, scaling, and monitoring to ensure clusters are fast, resilient, and self-healing
  • Collaborate closely with research and product teams to enable distributed training at scale, optimizing for speed, reliability, and utilization
  • Implement robust observability and alerting systems to monitor GPU health, node stability, and job performance
  • Diagnose and triage hardware, networking, and distributed training issues across environments, coordinating with provider support as needed
  • Continuously improve cluster reliability, developer ergonomics, and overall system efficiency across Cartesia's research and production workloads

What You Bring

  • Strong engineering fundamentals and experience building and operating large-scale distributed systems
  • Deep familiarity with HPC & GPU cluster management using Kubernetes and Slurm
  • A blend of developer empathy and raw performance engineering, designing systems and tools that are intuitive to use and fast
  • Ability to balance principled engineering with the urgency of keeping mission-critical systems alive
  • Proficiency with Infrastructure-as-Code tools (Terraform, Ansible, etc.) and observability tools (Prometheus, Grafana, etc.)
  • Strong debugging skills- comfortable diagnosing NCCL issues, CUDA errors, and network or driver-level faults.
  • What Sets You Apart

  • Experience optimizing large-scale distributed training frameworks such as DeepSpeed, Megatron-LM, or similar
  • Familiarity with advanced parallelization techniques such as FSDP, context parallelism, or tensor parallelism
  • Our Culture

    We're an in-person team based out of San Francisco. We love being in the office, hanging out together, and learning from each other every day.

    We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don't sacrifice quality or design along the way.

    We support each other. We have an open & inclusive culture that's focused on giving everyone the resources they need to succeed.

    Create a job alert for this search

    Infrastructure Engineer • San Francisco, CA, United States

    Related jobs
    • Promoted
    Cloud Infrastructure Engineer

    Cloud Infrastructure Engineer

    Pacific FusionSan Leandro, CA, United States
    Full-time
    Pacific Fusion was founded in 2023 with the mission to power the world with abundant, affordable, clean energy.We are rapidly designing and building a pulsed magnetic fusion system to achieve net f...Show moreLast updated: 30+ days ago
    • Promoted
    Infrastructure Deployment Engineer

    Infrastructure Deployment Engineer

    Cloudflare IncSan Francisco, CA, United States
    Full-time
    At Cloudflare, we are on a mission to help build a better Internet.Today the company runs one of the world's largest networks that powers millions of websites and other Internet properties for cust...Show moreLast updated: 1 day ago
    • Promoted
    Cloud Infrastructure Engineer

    Cloud Infrastructure Engineer

    Info Way SolutionsFremont, CA, United States
    Full-time
    Job Title : Cloud Infrastructure and Java Backend Developer.Location : [Specify location if applicable].Company Description : [Provide a brief overview of the company and its industry.Job Description : ...Show moreLast updated: 30+ days ago
    • Promoted
    Cloud Infrastructure Engineer

    Cloud Infrastructure Engineer

    Glean.infoSan Francisco, CA, United States
    Full-time
    Founded in 2019, Glean is an innovative AI-powered knowledge management platform designed to help organizations quickly find, organize, and share information across their teams.By integrating seaml...Show moreLast updated: 30+ days ago
    • Promoted
    Infrastructure Engineer - eero, eero Foundations - Cloud Systems and Infrastructure

    Infrastructure Engineer - eero, eero Foundations - Cloud Systems and Infrastructure

    AmazonSan Francisco, CA, United States
    Full-time
    WiFi has become a critical component to every home worldwide.Amazon Company, is the first product to deliver a whole home WiFi experience using mesh technology to make sure you never have to worry ...Show moreLast updated: 1 day ago
    • Promoted
    Infrastructure Engineer - (Dublin, CA)

    Infrastructure Engineer - (Dublin, CA)

    Articul8Dublin, CA, United States
    Full-time
    At Articul8 AI, we relentlessly pursue excellence and create exceptional AI products that exceed customer expectations.We are a team of dedicated individuals who take pride in our work and strive f...Show moreLast updated: 1 day ago
    • Promoted
    Global Infrastructure Engineer

    Global Infrastructure Engineer

    METANewark, CA, United States
    Full-time
    The Site Operations team is responsible for the delivery of data center compute and storage at Meta, enabling our family of apps and services to support a growing global community.We are seeking a ...Show moreLast updated: 1 day ago
    • Promoted
    Senior Infrastructure Engineer

    Senior Infrastructure Engineer

    PumpSan Francisco, CA, United States
    Full-time
    Cloud spend is a whopping $500 billion / yr, the biggest growing expense category for any tech company - tackling these costs requires continuous effort and time from DevOps teams.Pump is a building ...Show moreLast updated: 30+ days ago
    • Promoted
    Infrastructure Engineer

    Infrastructure Engineer

    FactorySan Francisco, CA, United States
    Full-time
    Factory is seeking seasoned Infrastructure Engineers to architect, build, and maintain our cloud infrastructure.Lead the design and implementation of robust, secure, and highly scalable cloud infra...Show moreLast updated: 30+ days ago
    • Promoted
    Software Infrastructure & Platform Engineer

    Software Infrastructure & Platform Engineer

    PsiQuantumPalo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
    • Promoted
    IT Infrastructure Engineer

    IT Infrastructure Engineer

    Vercel CorpSan Francisco, CA, United States
    Full-time
    Vercel gives developers the tools and cloud infrastructure to build, scale, and secure a faster, more personalized web.AI SDK, Vercel helps customers like Ramp, Supreme, PayPal, and Under Armour bu...Show moreLast updated: 1 day ago
    • Promoted
    Senior Infrastructure Engineer

    Senior Infrastructure Engineer

    DigitalOceanSan Francisco, CA, United States
    Full-time
    Dive in and do the best work of your career at DigitalOcean.Journey alongside a strong community of top talent who are relentless in their drive to build the simplest scalable cloud.If you have a g...Show moreLast updated: 1 day ago
    • Promoted
    Infrastructure Engineer (Dublin, CA)

    Infrastructure Engineer (Dublin, CA)

    Articul8 AIDublin, CA, United States
    Full-time
    Infrastructure Engineer (Dublin, CA).Articul8 AI is seeking an exceptional Product / Software Engineer-Infrastructure to join us in shaping the future of Generative Artificial Intelligence (GenAI).We...Show moreLast updated: 1 day ago
    • Promoted
    MTS, Infrastructure Engineer

    MTS, Infrastructure Engineer

    DelphinaSan Francisco, CA, United States
    Full-time
    Today's Data Scientists are in pain - spending their time manually wrangling data, building models through slow trial and error, taking on painstaking rewrites for deployment, and dealing with coun...Show moreLast updated: 1 day ago
    • Promoted
    • New!
    Senior Kubernetes & Infrastructure Engineer

    Senior Kubernetes & Infrastructure Engineer

    Third Wave AutomationUnion City, CA, United States
    Full-time
    Third Wave Automation is a rapidly growing startup that has demonstrated its core technology components, proven its market fit, and just closed its Series C funding. If you are excited about cutting...Show moreLast updated: 10 hours ago
    • Promoted
    Infrastructure Engineer, GPU

    Infrastructure Engineer, GPU

    DigitalOceanSan Francisco, CA, United States
    Full-time
    Dive in and do the best work of your career at DigitalOcean.Journey alongside a strong community of top talent who are relentless in their drive to build the simplest scalable cloud.If you have a g...Show moreLast updated: 1 day ago
    • Promoted
    Infrastructure Engineer

    Infrastructure Engineer

    DescriptSan Francisco, CA, United States
    Full-time
    Descript is on a mission to make audio and video content creation and editing fast, easy, and accessible to all.We are building a cutting-edge media editor incorporating real time collaboration, gr...Show moreLast updated: 1 day ago
    • Promoted
    • New!
    Supercompute Infrastructure Engineer

    Supercompute Infrastructure Engineer

    Periodic LabsMenlo Park, CA, United States
    Full-time
    We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries.We are well funded and growing rapidly. Team members are owners who identity and solve prob...Show moreLast updated: 7 hours ago