Talent.com
Site Reliability Engineer
Site Reliability EngineerRunloop AI, Inc • San Francisco, CA, United States
Site Reliability Engineer

Site Reliability Engineer

Runloop AI, Inc • San Francisco, CA, United States
2 days ago
Job type
  • Full-time
Job description

About Runloop

Runloop is building the foundational infrastructure for the next generation of AI development. We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxes. Our platform enables teams to experiment, iterate, and deploy their projects without the friction of environment setup and dependencies. We are a small but mighty team dedicated to building a rock-solid platform that empowers innovation.

The Role

We're looking for a skilled and passionate Site Reliability Engineer to join our team. As an SRE, you'll be responsible for the reliability, observability, performance, and security of our core platform-the very foundation on which our users build their futures. You'll work closely with our engineering team to develop and maintain the systems that power our code sandboxes, ensuring a seamless and stable experience for our customers. This is a critical role that blends a deep understanding of operations with a software engineering mindset.

Responsibilities

  • Design and maintain our production infrastructure on cloud platforms like AWS, GCP, or Azure.
  • Monitor and respond to system alerts and incidents, ensuring high availability and a secure environment for our users' code using Grafana, Prometheus
  • Collaborate with developers to ensure new features and services are designed with scalability and reliability in mind.
  • Troubleshoot and resolve complex issues related to our infrastructure, networking, and the sandbox environment.
  • Participate in an on-call rotation to support our production systems.
  • Define and track SLIs / SLOs, manage error budgets, and proactively monitor distributed systems with logging and tracing.
  • Automate deployments, scaling, provisioning, and recovery tasks to reduce toil and build self-healing systems.
  • Lead incident response, conduct root-cause analysis, and facilitate blameless post-mortems to drive continual improvement.
  • Collaborate cross-functionally with product, engineering, and developer relations to ensure reliable releases and an outstanding developer experience.
  • Plan for capacity growth, forecast system usage, and contribute to safe release and change management processes.
  • Mentor and support front-end developers in building reliable distributed front-end systems (CDNs, caching, client-side observability).

Qualifications

  • Proven experience as an SRE, DevOps Engineer, or similar role.
  • Strong programming skills in languages like Python or Go.
  • Deep expertise in containerization technologies such as Docker and Kubernetes.
  • Experience with cloud infrastructure and tools like Terraform and / or Pulumi.
  • Familiarity with monitoring and alerting tools like Prometheus, Grafana, or Datadog.
  • A solid understanding of networking, security, and Linux systems administration.
  • Experience designing, scaling, and maintaining distributed systems (backend platforms, APIs, or front-end infrastructure).
  • Proficiency in implementing observability frameworks (metrics, logging, tracing) and aligning reliability goals with developer velocity.
  • Hands-on experience managing incidents, running on-call operations, and producing actionable post-mortems.

  • Ability to mentor engineers and influence reliability practices across teams, especially for front-end infrastructure and performance.
  • Bonus Points

  • Experience with chaos engineering techniques, front-end observability tools (e.g., Sentry, RUM, synthetic monitoring), or building CI / CD pipelines for front-end delivery.
  • Benefits

  • Competitive salary and equity.
  • Comprehensive health, dental, and vision insurance for you and your dependents
  • Opportunity to work on cutting-edge AI technology and make a real impact on the future of software engineering.
  • Free lunch and snacks
  • Location :

  • In office 4 days a week in San Francisco, optional 1 day a week WFH
  • Join Us If you're excited about shaping the future of AI-driven software engineering and empowering developers to build the next generation of coding tools, we want to hear from you. Join Runloop and be at the forefront of the AI revolution in software development.

    Runloop is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability status, protected veteran status, sexual orientation, gender identity or any other characteristic protected by law.

    Create a job alert for this search

    Site Reliability Engineer • San Francisco, CA, United States

    Related jobs
    Site Reliability Engineer

    Site Reliability Engineer

    Together AI • San Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantum • Palo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Rethink recruit • San Francisco, CA, United States
    Full-time
    Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood Materials, Inc. • San Francisco, CA, United States
    Full-time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Compunnel • Richmond, CA, United States
    Full-time
    The Site Reliability Engineer will be responsible for ensuring the reliability, availability, and performance of applications and services as part of the transition from private to public cloud.Thi...Show more
    Last updated: 1 day ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Insight Global • Santa Clara, CA, United States
    Full-time
    Insight Global is looking for a seasoned SRE to join one of our largest technology clients' multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Fortinet • Sunnyvale, CA, United States
    Full-time
    At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...Show more
    Last updated: 26 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood Materials • San Francisco, CA, United States
    Full-time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling.We are seeking a highly skilled and motivated Site Reliability Engineer to collect requ...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Runloop AI • San Francisco, CA, United States
    Full-time
    Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show more
    Last updated: 11 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    WorkOS • San Francisco, CA, United States
    Full-time
    About WorkOS 🚀 WorkOS builds tools and services for developers to help them implement authentication, identity, authorization, and overall enterprise readiness. We’re a fully distributed team with ...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Xai • Palo Alto, CA, United States
    Full-time
    AIs mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellen...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PSI Quantum • Palo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer I

    Site Reliability Engineer I

    Prosper • San Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show more
    Last updated: 8 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Replit • Foster City, CA, United States
    Full-time
    Replit is the agentic software creation platform that enables anyone to build applications using natural language.With millions of users worldwide and over 500,000 business users, Replit is democra...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Signify Technology • Palo Alto, CA, US
    Full-time
    Competitive, based on experience.We are a technology startup advancing healthcare with a safety-focused AI platform that assists medical professionals by managing patient communications, including ...Show more
    Last updated: 20 days ago • Promoted
    Site Reliability Engineer I

    Site Reliability Engineer I

    Prosper.com • San Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer - Supercomputing

    Site Reliability Engineer - Supercomputing

    Xai • Palo Alto, CA, United States
    Full-time
    Site Reliability Engineer - Supercomputing.We are seeking a talented Site Reliability Engineer (SRE) to join our SuperComputing team. In this role, you'll ensure the reliability, scalability, and pe...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Rockwoods Inc • Pleasanton, CA, US
    Full-time
    Note : Candidates must have relevant experience in Medical / Healthcare domains, this is mandatory.Senior SRE Engineer - Pleasanton, 5 days office. Primary work : 24x7 On-call support and setting up mo...Show more
    Last updated: 20 days ago • Promoted