Talent.com
Site Reliability Engineer - Inference

Site Reliability Engineer - Inference

Jobright.aiSan Francisco, CA, United States
17 hours ago
Job type
  • Full-time
Job description

Join to apply for the Site Reliability Engineer - Inference role at Jobright.ai

2 days ago Be among the first 25 applicants

Join to apply for the Site Reliability Engineer - Inference role at Jobright.ai

Get AI-powered advice on this job and more exclusive features.

Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a staffing agency. Jobright does not hire directly for these positions. We connect you with verified openings from employers you can trust.

Job Summary :

Lambda is the #1 GPU Cloud for ML / AI teams, providing tools for building, testing, and deploying AI products at scale. The Site Reliability Engineer - Inference will work on developing a large-scale platform for running AI models and building a high-throughput, low-latency API for distributed systems.

Responsibilities :

  • Work on our Inference service, helping us to develop our large-scale platform for running new, cutting-edge models across tens of thousands of GPUs
  • Help build a high-throughput, low-latency API and routing system running at geographically-distributed scale
  • Shape a highly reliable distributed system with a focus on reducing operational overhead and deep observability and capacity management.
  • Work with the team and our internal ML researchers to adopt and improve new inference engines, models and architectures across a variety of different mediums (such as text, image, video and audio)
  • Tackle global networking challenges to deliver the lowest possible latency to our users across all of Lambda’s available capacity
  • Help push Lambda forward into the state of the art, and be part of a team that is operating right at the edge of new developments in the industry.

Qualifications : Required :

  • 8 or more years of experience as a software reliability engineer or software engineer working on large-scale, internet-facing production services
  • Highly skilled at writing Go and Python
  • Experience with bare-metal system installation and administration
  • Experience deploying applications and operators on Kubernetes
  • Product-focused, balancing operational needs and keeping overheads down with the need to ship features at a rapid pace
  • Proven track record of working in an environment with rapid deployment and the ability to stay on top of shifting priorities as the industry rapidly develops
  • Willingness to take ownership of projects and help drive them forwards through design, implementation, launch, and maintenance.
  • Preferred :

  • Experience working with machine learning models
  • Experience operating large-scale, geographically distributed systems
  • Experience developing Kubernetes operators and components
  • Company :

    Lambda provides infrastructure, cloud services, and software for the training and inferencing of AI models. Founded in 2012, headquartered in San Jose, California, USA, team size 201-500 employees, currently Late Stage. Lambda has a track record of offering H1B sponsorships.

    Seniority level

    Seniority level

    Mid-Senior level

    Employment type

    Employment type

    Full-time

    Job function

    Industries

    Software Development

    Referrals increase your chances of interviewing at Jobright.ai by 2x

    Inferred from the description for this job

    Medical insurance

    Vision insurance

    401(k)

    Get notified when a new job is posted.

    Sign in to set job alerts for “Site Reliability Engineer” roles.

    San Francisco, CA $160,000.00-$180,000.00 4 days ago

    Software Engineer, Infrastructure, Early Career

    San Francisco, CA $126,000.00-$170,000.00 11 hours ago

    San Francisco, CA $180,000.00-$280,000.00 3 days ago

    San Francisco, CA $130,000.00-$238,000.00 1 day ago

    San Francisco, CA $150,000.00-$250,000.00 1 day ago

    San Francisco, CA $150,000.00-$230,000.00 4 months ago

    San Francisco, CA $99,500.00-$200,000.00 2 weeks ago

    Full-Stack Software Engineer (Jr / Mid level)

    San Francisco, CA $120,000.00-$180,000.00 1 day ago

    San Francisco, CA $56.25-$137,000.00 5 days ago

    Software Development Engineer I - Frontend & Mobile

    San Francisco, CA $99,500.00-$200,000.00 3 weeks ago

    San Francisco, CA $160,000.00-$200,000.00 2 months ago

    San Francisco, CA $150,000.00-$176,000.00 3 months ago

    San Francisco, CA $120,000.00-$190,000.00 9 months ago

    San Francisco, CA $130,000.00-$140,000.00 2 weeks ago

    Software Engineer, AI Intern (Summer 2026)

    San Francisco, CA $125,000.00-$175,000.00 2 months ago

    Software Engineer, AI Intern (Winter 2026)

    San Francisco, CA $130,000.00-$240,000.00 2 weeks ago

    San Francisco, CA $163,200.00-$223,200.00 3 days ago

    Software Engineer, Frontend (All Levels)

    San Francisco, CA $150,000.00-$220,000.00 2 weeks ago

    San Francisco, CA $150,000.00-$283,000.00 4 days ago

    San Francisco, CA $155,000.00-$339,500.00 2 weeks ago

    San Francisco, CA $140,000.00-$280,000.00 8 months ago

    San Francisco, CA $165,000.00-$165,000.00 2 years ago

    San Francisco, CA $120,000.00-$200,000.00 2 years ago

    We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

    #J-18808-Ljbffr

    Create a job alert for this search

    Site Reliability Engineer • San Francisco, CA, United States

    Related jobs
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood Materials, Inc.San Francisco, CA, United States
    Full-time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show moreLast updated: 17 hours ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    criteoPalo Alto, CA, United States
    Full-time
    At Criteo we face challenging problems in the IT industry at scale.Our data is large and our systems require speed and complexity handling. We have about 40 petabytes in Hadoop storage and respond t...Show moreLast updated: 17 hours ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    WritemedSan Francisco, CA, United States
    Full-time
    Would you like to join one of the fastest-growing organizations with a goal of using the latest AI, GenAI, LLM, Cloud, and Digital Technologies to advance drug development and improve patient care ...Show moreLast updated: 17 hours ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    Berkley HuntSan Francisco, CA, United States
    Full-time
    Senior Site Reliability Engineer (GPU Compute) | Hybrid — Bay Area, CA.Berkley Hunt is supporting a fast-growing AI startup building a high-performance, cloud-native platform to power cutting-edge ...Show moreLast updated: 17 hours ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    Together AISan Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show moreLast updated: 17 hours ago
    • Promoted
    • New!
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    JPMorganChasePalo Alto, CA, United States
    Full-time
    Join a globally recognized financial organization and advance your profession to new heights by contributing to revolutionary projects. You've discovered the perfect environment to have a major impa...Show moreLast updated: 17 hours ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PerplexitySan Francisco, CA, United States
    Full-time
    Perplexity is an AI-powered answer engine founded in December 2022 and growing rapidly as one of the world’s leading AI platforms. Perplexity has raised over $1B in venture investment from some of t...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    AlchemySan Francisco, CA, United States
    Full-time
    Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    CheckrSan Francisco, CA, United States
    Full-time
    Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 17 hours ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    xAIPalo Alto, CA, United States
    Full-time
    AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...Show moreLast updated: 17 hours ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood MaterialsSan Francisco, CA, United States
    Full-time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling — keeping critical minerals in circulation and driving the energy transition.Founded in...Show moreLast updated: 17 hours ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PrimerSan Francisco, CA, United States
    Full-time
    Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    Jobs via DiceRedwood City, CA, United States
    Full-time
    Dice is the leading career destination for tech experts at every stage of their careers.Our client, Kforce Technology Staffing, is seeking a Reliability Engineer in Redwood City, CA.Deliver high-le...Show moreLast updated: 17 hours ago
    • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    Hinge HealthSan Francisco, CA, United States
    Full-time
    From scaling Kubernetes clusters to improving observability with Datadog, we build the tooling and automation that empower product teams to ship with confidence. Collaborate with engineering teams t...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    Bits to AtomsSan Francisco, CA, United States
    Full-time
    Site Reliability Engineer (SRE).You’ll work at the intersection of infrastructure, AI / ML systems, and mission-critical physical operations. You’ll collaborate directly with engineering, AI, and oper...Show moreLast updated: 17 hours ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    ZapierSan Francisco, CA, United States
    Full-time
    We're humans who simply think computers should do more work.At Zapier, we’re not just making software—we’re building a platform to help millions of businesses globally scale with automation and AI....Show moreLast updated: 17 hours ago
    • Promoted
    • New!
    Senior Site Reliability Engineer Denver, Colorado, United States; San Francisco, California, Un[...]

    Senior Site Reliability Engineer Denver, Colorado, United States; San Francisco, California, Un[...]

    CheckrSan Francisco, CA, United States
    Full-time
    Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 17 hours ago
    • Promoted
    Associate Site Reliability Engineer / Site Reliability Engineer

    Associate Site Reliability Engineer / Site Reliability Engineer

    MedStar HealthRedwood City, CA, United States
    Full-time
    C3 AI (NYSE : AI), is the Enterprise AI application software company.C3 AI delivers a family of fully integrated products including the C3 Agentic AI Platform, an end-to-end platform for developing,...Show moreLast updated: 30+ days ago