Talent.com
Engineering Manager, AI Inference Infrastructure

Engineering Manager, AI Inference Infrastructure

BasetenSan Francisco, CA, United States
1 day ago
Job type
  • Full-time
Job description

ABOUT BASETEN

Baseten powers inference for the world's most dynamic AI companies, like OpenEvidence, Clay, Mirage, Gamma, Sourcegraph, Writer, Abridge, Bland, and Zed. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting‑edge models into production. With our recent $150M Series D funding, backed by investors including BOND, IVP, Spark Capital, Greylock, and Conviction, we’re scaling our team to meet accelerating customer demand.

Engineering Manager, AI Inference Infrastructure

The Role

As an Engineering Manager (Player & Coach) focused on AI Inference Infrastructure, you’ll lead a team responsible for the performance, reliability, and success of large‑scale ML workloads in production. Applying both hands‑on technical ownership and managerial leadership, you will guide your team through complex incidents while improving observability and operational practices and shaping how we deliver world‑class AI infrastructure support to our customers. While you will actively coach and grow your team, you’ll also stay close to the technology including diving into runtime debugging, optimizing GPU utilization, and helping evolve the Baseten platform based on real‑world patterns and customer feedback.

Responsibilities

  • Lead, mentor, and scale a team of Support Engineers specializing in AI and ML production environments, fostering technical depth, accountability, and a customer‑first mindset.
  • Serve as a player‑coach, directly contributing to complex troubleshooting, inference optimization, and incident resolution for high‑value enterprise customers.
  • Diagnose and resolve runtime issues impacting model performance, such as latency spikes, memory pressure, GPU scheduling, and concurrency management.
  • Debug Kubernetes infrastructure (pods, controllers, networking) and observability stacks using tools like Grafana, Loki, and Prometheus.
  • Own critical incidents end‑to‑end — coordinating across Engineering, Product, and Sales to ensure timely resolution, transparent communication, and SLA compliance.
  • Drive continuous improvement by enhancing diagnostic runbooks, refining alerting strategies, and developing internal automation for faster root‑cause analysis.
  • Collaborate with product and platform teams to surface insights from production issues — shaping roadmap priorities around reliability, inference efficiency, and operational scalability.
  • Lead initiatives that enhance observability, monitoring, and alerting for AI workloads across distributed compute environments.
  • Balance tactical execution with strategic vision, ensuring your team not only resolves today’s issues but also builds systems that prevent tomorrow’s.

Requirements

  • Proven experience leading or mentoring technical teams in Support Engineering, Infrastructure, or Site Reliability within production AI / ML or distributed systems environments.
  • Deep Kubernetes troubleshooting expertise, including advanced resource debugging, runtime performance analysis, and observability‑driven diagnostics.
  • Hands‑on experience managing distributed systems or AI products at scale — optimizing GPU / CPU utilization, batch sizing, concurrency, and memory efficiency.
  • Expertise with observability and monitoring tools (Grafana, Prometheus, Loki) and alerting best practices.
  • Skilled in incident management and customer escalation handling, with a proven ability to drive clarity and confidence in high‑stakes situations.
  • Demonstrated project management and organizational skills, capable of orchestrating multi‑stakeholder efforts from incident triage through resolution and RCA.
  • Bonus / Nice‑to‑Have

  • Experience implementing or managing incident‑response and ticketing systems (e.g., Zendesk, Pylon).
  • BENEFITS

  • Competitive compensation, including meaningful equity.
  • 100% coverage of medical, dental, and vision insurance for employee and dependents.
  • Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!).
  • Paid parental leave.
  • Company‑facilitated 401(k).
  • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
  • Apply now to embark on a rewarding journey in shaping the future of AI! If you are a motivated individual with a passion for machine learning and a desire to be part of a collaborative and forward‑thinking team, we would love to hear from you.

    At Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status.

    #J-18808-Ljbffr

    Create a job alert for this search

    Engineering Manager Infrastructure • San Francisco, CA, United States

    Related jobs
    • Promoted
    Software Engineering Manager II, Infrastructure, Google Cloud AI

    Software Engineering Manager II, Infrastructure, Google Cloud AI

    Google Inc.Sunnyvale, CA, United States
    Full-time
    Software Engineering Manager II, Infrastructure, Google Cloud AI.Bachelor's degree or equivalent practical experience.Master’s degree or PhD in Engineering, Computer Science, or a related technical...Show moreLast updated: 30+ days ago
    • Promoted
    Engineering Manager, Managed AI

    Engineering Manager, Managed AI

    Epoch BiodesignSan Francisco, CA, United States
    Full-time
    Crusoe is building the World’s Favorite AI-first Cloud infrastructure company.We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to p...Show moreLast updated: 30+ days ago
    • Promoted
    Engineering Manager, Analytics Platform

    Engineering Manager, Analytics Platform

    SentrySan Francisco, CA, United States
    Full-time
    Bad software is everywhere, and we’re tired of it.Sentry is on a mission to help developers write better software faster so we can get back to enjoying technology. With more than $217 million in fun...Show moreLast updated: 12 days ago
    • Promoted
    Engineering Manager, Infrastructure Engineering

    Engineering Manager, Infrastructure Engineering

    SigmaSan Francisco, CA, United States
    Full-time
    Get AI-powered advice on this job and more exclusive features.At Sigma, we are redefining how modern businesses leverage data by building a high-performance, cloud-native analytics platform.As we s...Show moreLast updated: 30+ days ago
    • Promoted
    Sr. AI Product Manager - Key Components 1

    Sr. AI Product Manager - Key Components 1

    SupermicroSan Jose, CA, United States
    Full-time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...Show moreLast updated: 3 days ago
    • Promoted
    Engineering Manager, R&D

    Engineering Manager, R&D

    SharkNinjaAlbany, CA, United States
    Full-time
    SharkNinja is a global product design and technology company, with a diversified portfolio of 5-star rated lifestyle solutions that positively impact people’s lives in homes around the world.Powere...Show moreLast updated: 30+ days ago
    • Promoted
    Engineering Manager, Handshake AI

    Engineering Manager, Handshake AI

    HandshakeSan Francisco, CA, United States
    Full-time
    Our three-sided marketplace connects 18 million students and alumni, 1,500+ academic institutions across the U.Europe, and 1 million employers to power how the next generation explores careers, bui...Show moreLast updated: 30+ days ago
    • Promoted
    Engineering Manager, Data Engineering

    Engineering Manager, Data Engineering

    AircallSan Francisco, CA, US
    Full-time
    Engineering Manager, Data Engineering.Aircall is the world's leading integrated customer communications and intelligence platform for growing businesses. Trusted by over 20,000 companies worldwide, ...Show moreLast updated: 30+ days ago
    • Promoted
    Engineering Manager, Search Product Infra

    Engineering Manager, Search Product Infra

    OpenAISan Francisco, CA, United States
    Full-time
    Engineering Manager, Search Product Infra.Applied AI Engineering – San Francisco.The OpenAI Search team is reimagining the search experience for the AI era. Working across research, engineering, pro...Show moreLast updated: 24 days ago
    • Promoted
    • New!
    Engineering Manager - Brain Health, AI

    Engineering Manager - Brain Health, AI

    Hinge HealthSan Francisco, CA, US
    Full-time
    Engineering Manager, Brain Health Platform.As the Engineering Manager, Brain Health Platform, you will lead the development and execution of innovative mental wellness solutions at Hinge Health.You...Show moreLast updated: 15 hours ago
    • Promoted
    • New!
    Director of Engineering, AI SQL

    Director of Engineering, AI SQL

    SnowflakeMenlo Park, CA, United States
    Full-time
    Snowflake is about empowering enterprises to achieve their full potential — and people too.With a culture that’s all in on impact, innovation, and collaboration, Snowflake is the sweet spot for bui...Show moreLast updated: 19 hours ago
    • Promoted
    Engineering Manager - AI Products

    Engineering Manager - AI Products

    Perplexity AI Inc.San Francisco, CA, United States
    Full-time
    Perplexity is seeking a Machine Learning / Software Engineering Manager to lead the AI Products team to build our AI Products spanning from the Comet Browser Agent, to the Deep Research Search Agen...Show moreLast updated: 3 days ago
    • Promoted
    Engineering Manager- Machine Learning Infrastructure

    Engineering Manager- Machine Learning Infrastructure

    Plaid IncSan Francisco, CA, United States
    Full-time
    Plaid is evolving into an AI-first company, where data and machine learning are the key enablers of smarter, more secure insight products built on top of Plaid’s vast financial data network.The Mac...Show moreLast updated: 30+ days ago
    • Promoted
    Engineering Manager AI Agents

    Engineering Manager AI Agents

    Zendesk, Inc.San Francisco, CA, United States
    Full-time
    Engineering Manager AI Agents page is loaded## Engineering Manager AI Agentstime type : Full timeposted on : Posted Todayjob requisition id : R32586## Job DescriptionAs an • •Engineering Manager •...Show moreLast updated: 3 days ago
    • Promoted
    Engineering Manager, Core Infrastructure

    Engineering Manager, Core Infrastructure

    Retool Inc.San Francisco, CA, United States
    Full-time
    Nearly every company in the world runs on custom software for critical operations like tracking performance metrics, handling customer support workflows, building admin dashboards, and countless ot...Show moreLast updated: 27 days ago
    • Promoted
    Engineering Manager, Mulitimodal (API)

    Engineering Manager, Mulitimodal (API)

    OpenAISan Francisco, CA, United States
    Full-time
    OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity.Our API is the industry's most widely adopted AI platform, empowering startups, indie developers, a...Show moreLast updated: 30+ days ago
    • Promoted
    Applied AI Engineering Manager, Enterprise

    Applied AI Engineering Manager, Enterprise

    Scale AI, Inc.San Francisco, CA, United States
    Full-time
    AI is becoming vitally important in every function of our society.At Scale, our mission is to accelerate the development of AI applications. For 8 years, Scale has been the leading AI data foundry, ...Show moreLast updated: 30+ days ago
    • Promoted
    Engineering Manager - Platform & Infrastructure

    Engineering Manager - Platform & Infrastructure

    ProlificSan Francisco, CA, US
    Full-time
    Engineering Manager - Platform & Infrastructure.Prolific is not just another player in the AI space we are the architects of the human data infrastructure that's reshaping the landscape of AI deve...Show moreLast updated: 3 days ago