Talent.com
Senior Systems Engineer, Infrastructure & Platform Reliability

Senior Systems Engineer, Infrastructure & Platform Reliability

LambdaSan Francisco, CA, United States
4 hours ago
Job type
  • Full-time
Job description

Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference. Lambda's mission is to make compute as ubiquitous as electricity and give every person access to artificial intelligence. One person, one GPU.

If you'd like to build the world's best deep learning cloud, join us.

  • Note : This position requires presence in our San Francisco or San Jose office location 4 days per week; Lambda's designated work from home day is currently Tuesday.

Information Systems at Lambda is responsible for building and scaling the internal systems that power our business. We partner across the company-Finance, GTM, Engineering, and People-to implement tools, automate workflows, and ensure data flows securely and accurately. Our scope includes enterprise applications, integrations, data platform and analytics, compliance automation, and all things IT.

What You'll Do

  • Design, write, and deliver software and services to improve the availability, scalability, reliability, and efficiency of Lambda's internal IT systems and platforms.
  • Solve problems relating to mission critical services and build automation to prevent problem recurrence with the goal of automating response to all non-exceptional events.
  • Work with Lambda Engineering and internal teams to Influence and create new designs, architectures, standards, and methods for large-scale distributed systems.
  • Engage in service capacity planning and demand forecasting, software performance analysis, and system tuning.
  • Be an excellent communicator, producing documentation and related artifacts for the systems you are responsible for.
  • You

  • Have a keen interest in system design, architecting for performance, scalability, and experience with multiple cloud infrastructure platforms (AWS, GCP, Azure, etc.).
  • Think carefully about systems : edge cases, failure modes, behaviors, and specific implementations.
  • Know and prefer configuration management systems and toolchains (Chef, Ansible, Terraform, GitHub Actions, etc.)
  • Have solid programming skills : Python, Go, etc.
  • Have an urge to collaborate and communicate asynchronously, combined with a desire to record and document issues and solutions.
  • Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.
  • Have an urge for delivering quickly and effectively, and iterating fast.
  • Nice to Have

  • Experience and interest in ML / AI workloads and compute
  • Practical experience implementing and managing paging, alerting, and on-call scheduling flows
  • A positive attitude, combined with a desire to learn and collaborate
  • Salary Range Information

    The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.

    About Lambda

  • Founded in 2012, ~400 employees (2025) and growing fast
  • We offer generous cash & equity compensation
  • Our investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.
  • We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
  • Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and Commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible Paid Time Off Plan that we all actually use
  • A Final Note :

    You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.

    Equal Opportunity Employer

    Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

    Create a job alert for this search

    Senior Systems Engineer, Infrastructure & Platform Reliability • San Francisco, CA, United States

    Related jobs
    • Promoted
    Sr. IT Systems Engineer (26646)

    Sr. IT Systems Engineer (26646)

    SupermicroSan Jose, CA, United States
    Full-time
    Supermicrois a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers ...Show moreLast updated: 10 days ago
    • Promoted
    Senior Systems Engineer

    Senior Systems Engineer

    Center for Elders' IndependenceOakland, CA, US
    Full-time
    The Center for Elders’ Independence.PACE (Program of All-Inclusive Care for the elderly) organization (PO) that uses an interdisciplinary team approach for care planning and implementing purp...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Systems Reliability Engineer

    Senior Systems Reliability Engineer

    Serve RoboticsRedwood City, CA, US
    Full-time
    At Serve Robotics, we’re reimagining how things move in cities.Our personable sidewalk robot is our vision for the future. It’s designed to take deliveries away from congested streets, m...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer Cloud Platform

    Senior Site Reliability Engineer Cloud Platform

    ZillizRedwood City, CA, United States
    Full-time
    Zilliz is a fast-growing startup developing the industry's leading vector database company for enterprise-grade AI.Founded by the engineers behind Milvus, the world's most popular open-source vecto...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Systems Engineer

    Senior Systems Engineer

    Robert HalfSan Francisco, CA, US
    Full-time
    We are looking for an experienced Systems Engineer to oversee and enhance the performance, security, and scalability of IT infrastructures across corporate and client environments.This role encompa...Show moreLast updated: 19 days ago
    • Promoted
    Infrastructure & Systems Engineer

    Infrastructure & Systems Engineer

    VIGILENT CORPORATIONOakland, CA, US
    Full-time +1
    Vigilent is looking for world-class talent to help us achieve our mission of improving facility operations while creating a more sustainable planet. Vigilent applies machine learning, Al and expert ...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer (Cloud Infra)

    Senior Site Reliability Engineer (Cloud Infra)

    Mumba Technologies, Inc.Palo Alto, CA, US
    Full-time
    We are seeking a highly skilled.Senior Site Reliability Engineer.In this role responsibilities will include designing and implementing infrastructure automation, continuous integration and delivery...Show moreLast updated: 4 days ago
    • Promoted
    Site Reliability Engineer - Infrastructure

    Site Reliability Engineer - Infrastructure

    VerkadaSan Mateo, CA, United States
    Full-time
    Designed with simplicity in mind, Verkada's six product lines - video security cameras, access control, environmental sensors, alarms, workplace, and intercoms - provide unparalleled building secur...Show moreLast updated: 7 days ago
    • Promoted
    Principal Information Systems Engineer - Systems Specialty - City and County of San Francisco -[...]

    Principal Information Systems Engineer - Systems Specialty - City and County of San Francisco -[...]

    San FranciscoSan Francisco, CA, United States
    Full-time
    San Francisco is a vibrant and dynamic city, on the forefront of economic growth & innovation, urban development, arts & entertainment, as well as social issues & change. This rich tapestry of cultu...Show moreLast updated: 24 days ago
    • Promoted
    Site Reliability Engineer, Frontier Systems Infrastructure

    Site Reliability Engineer, Frontier Systems Infrastructure

    OpenAISan Francisco, CA, United States
    Full-time
    The Frontier Systems team at OpenAI builds, launches, and supports the largest supercomputers in the world that OpenAI uses for its most cutting edge model training. We take data center designs, tur...Show moreLast updated: 3 days ago
    • Promoted
    Director, Site Reliability Engineering - Infrastructure Platform

    Director, Site Reliability Engineering - Infrastructure Platform

    Okta for DevelopersSan Francisco, CA, United States
    Permanent
    Director, Site Reliability Engineering - Infrastructure Platform.Join as the Director of Infrastructure Platform and Shared Services at Okta for Developers. Oversee multiple teams focused on Edge ne...Show moreLast updated: 1 day ago
    • Promoted
    Infrastructure, DevOps & Reliability Engineer (Multiple Roles, Remote & On-Site)

    Infrastructure, DevOps & Reliability Engineer (Multiple Roles, Remote & On-Site)

    MLabsSan Francisco, CA, US
    Remote
    Full-time
    We’re recruiting Infrastructure, DevOps, and Reliability Engineers for high-growth startups including.AirGarage, Dyno Therapeutics, Codex Health, and Banquet Health.These roles focus on scali...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Cloud Site Reliability Engineer, Actimize

    Principal Cloud Site Reliability Engineer, Actimize

    NICESanta Clara, CA, United States
    Full-time
    At NiCE, we don't limit our challenges.We set the highest standards and execute beyond them.And if you're like us, we can offer you the ultimate career opportunity that will light a fire within you...Show moreLast updated: 1 day ago
    • Promoted
    Software Engineer, Infrastructure Reliability

    Software Engineer, Infrastructure Reliability

    OpenAISan Francisco, CA, United States
    Full-time
    We're hiring software engineers to join our broader Infrastructure organization, which supports multiple high-impact teams. Depending on your interests and experience, you could work on one of sever...Show moreLast updated: 5 days ago
    • Promoted
    Senior System Engineer

    Senior System Engineer

    Robert HalfSan Francisco, CA, United States
    Full-time
    In this role, you’ll sit at the intersection of.You’ll write code, design integrations, automate workflows, and create dashboards that directly impact how a multi-billion-dollar real estate portfol...Show moreLast updated: 4 days ago
    • Promoted
    Principal Site Reliability Engineer Cloud Identity & Trust - 2nd Stage

    Principal Site Reliability Engineer Cloud Identity & Trust - 2nd Stage

    5 Star Global Recruitment PartnersSan Jose, CA, United States
    Full-time
    About the job Principal Site Reliability Engineer Cloud Identity & Trust - 2nd Stage.SPIFFE - Experience SPIRE - Experience Multiple Cloud Experience Kubernetes. Deep Knowledge base of Development I...Show moreLast updated: 30+ days ago
    • Promoted
    Lead Systems Engineer - Operational Platforms

    Lead Systems Engineer - Operational Platforms

    Robert HalfSan Francisco, CA, US
    Full-time
    Lead Engineer / Senior Lead Engineer.This role sits at the intersection of.Unlike traditional engineering roles, this position is. Asset Management, Property Operations, and Construction leaders to ...Show moreLast updated: 12 days ago
    • Promoted
    Senior Systems Engineer (Contract)

    Senior Systems Engineer (Contract)

    Blue Star Partners LLCPleasanton, CA, US
    Full-time
    Senior Systems Engineer (Contract).W-2 (Non Exempt; must be authorized to work in the U.For our client, we are looking for a highly skilled Senior Systems Engineer to lead our infrastructure core s...Show moreLast updated: 30+ days ago