Talent.com
Site Reliability Engineer
Site Reliability EngineerDevOps projects • San Francisco, CA, United States
Site Reliability Engineer

Site Reliability Engineer

DevOps projects • San Francisco, CA, United States
6 days ago
Job type
  • Full-time
Job description

Site Reliability Engineer

About Runloop

Runloop is building the foundational infrastructure for the next generation of AI development. We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxes. Our platform enables teams to experiment, iterate, and deploy their projects without the friction of environment setup and dependencies. We are a small but mighty team dedicated to building a rock-solid platform that empowers innovation.

The Role

We're looking for a skilled and passionate Site Reliability Engineer to join our team. As an SRE, you'll be responsible for the reliability, observability, performance, and security of our core platform—the very foundation on which our users build their futures. You'll work closely with our engineering team to develop and maintain the systems that power our code sandboxes, ensuring a seamless and stable experience for our customers. This is a critical role that blends a deep understanding of operations with a software engineering mindset.

Responsibilities

  • Design and maintain our production infrastructure on cloud platforms like AWS, GCP, or Azure.
  • Monitor and respond to system alerts and incidents, ensuring high availability and a secure environment for our users' code using Grafana, Prometheus.
  • Collaborate with developers to ensure new features and services are designed with scalability and reliability in mind.
  • Troubleshoot and resolve complex issues related to our infrastructure, networking, and the sandbox environment.
  • Participate in an on-call rotation to support our production systems.
  • Define and track SLIs / SLOs, manage error budgets, and proactively monitor distributed systems with logging and tracing.
  • Automate deployments, scaling, provisioning, and recovery tasks to reduce toil and build self-healing systems.
  • Lead incident response, conduct root‑cause analysis, and facilitate blameless post‑mortems to drive continual improvement.
  • Collaborate cross‑functionally with product, engineering, and developer relations to ensure reliable releases and an outstanding developer experience.
  • Plan for capacity growth, forecast system usage, and contribute to safe release and change management processes.
  • Mentor and support front‑end developers in building reliable distributed front‑end systems (CDNs, caching, client‑side observability).

Qualifications

  • Proven experience as an SRE, DevOps Engineer, or similar role.
  • Strong programming skills in languages like Python or Go.
  • Deep expertise in containerization technologies such as Docker and Kubernetes.
  • Experience with cloud infrastructure and tools like Terraform and / or Pulumi.
  • Familiarity with monitoring and alerting tools like Prometheus, Grafana, or Datadog.
  • A solid understanding of networking, security, and Linux systems administration.
  • Experience designing, scaling, and maintaining distributed systems (backend platforms, APIs, or front‑end infrastructure).
  • Proficiency in implementing observability frameworks (metrics, logging, tracing) and aligning reliability goals with developer velocity.
  • Hands‑on experience managing incidents, running on‑call operations, and producing actionable post‑mortems.

  • Ability to mentor engineers and influence reliability practices across teams, especially for front‑end infrastructure and performance.
  • Bonus Points

  • Experience with chaos engineering techniques, front‑end observability tools (e.g., Sentry, RUM, synthetic monitoring), or building CI / CD pipelines for front‑end delivery.
  • Competitive salary and equity.

    Comprehensive health, dental, and vision insurance for you and your dependents.

    Opportunity to work on cutting‑edge AI technology and make a real impact on the future of software engineering.

    Free lunch and snacks.

    Location :

    In office 4 days a week in San Francisco, optional 1 day a week WFH.

    Join Us If you're excited about shaping the future of AI-driven software engineering and empowering developers to build the next generation of coding tools, we want to hear from you. Join Runloop and be at the forefront of the AI revolution in software development.

    Runloop is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability status, protected veteran status, sexual orientation, gender identity or any other characteristic protected by law.

    #J-18808-Ljbffr

    Create a job alert for this search

    Site Reliability Engineer • San Francisco, CA, United States

    Related jobs
    Site Reliability Engineer

    Site Reliability Engineer

    ConductorOne • San Francisco, CA, United States
    Full-time
    ConductorOne is the first AI-native identity security platform that protects every identity : human, non-human, and AI.With powerful automation, platform-level AI, and out-of-the-box connectors, it ...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Latent • San Francisco, CA, United States
    Full-time
    Location : San Francisco, CA (5 Days In-Office).You are the infrastructure expert who enables our rapid product development and guarantees. AI platform for major health systems.Your focus on operatio...Show more
    Last updated: 30+ days ago • Promoted
    Senior Site Reliability Engineer – Platform

    Senior Site Reliability Engineer – Platform

    Icon Ventures • San Francisco, CA, United States
    Full-time
    At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.We blend cognitive science with machine learning to personalize and enhance the lear...Show more
    Last updated: 9 days ago • Promoted
    Site Reliability Engineer I

    Site Reliability Engineer I

    Prosper Marketplace • San Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Alchemy • San Francisco, CA, United States
    Full-time
    Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Together AI • San Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show more
    Last updated: 30+ days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Alembic Technologies • San Francisco, CA, United States
    Full-time
    Senior Site Reliability Engineer.This range is provided by Alembic Technologies.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.We’re looking fo...Show more
    Last updated: 8 days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Alembic • San Francisco, CA, United States
    Full-time
    We’re looking for an experienced.Site Reliability Engineer (SRE).You’ll partner with engineers and data scientists to build, automate, and maintain the infrastructure that powers our core platform—...Show more
    Last updated: 10 days ago • Promoted
    Site Reliability Engineer I

    Site Reliability Engineer I

    Prosper • San Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Fractal • San Francisco, CA, United States
    Full-time
    This range is provided by Fractal.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Fractal Analytics is a strategic AI partner to Fortune 500 com...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Speak • San Francisco, CA, United States
    Full-time
    Our mission is to reinvent the way people learn, starting with language.Learning a language can change a life by opening doors to new cultures, careers, and communities. Two billion people around th...Show more
    Last updated: 9 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Primer • San Francisco, CA, United States
    Full-time
    Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Flexton, Inc. • San Francisco, CA, United States
    Full-time
    Skill : You have excellent written and verbal communication skills.You have experience managing large websites or services within the context of a large scale web environment.You are able to execute...Show more
    Last updated: 30+ days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Hive • San Francisco, CA, United States
    Full-time
    Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    P2P • San Francisco, CA, United States
    Full-time
    Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    Hinge Health • San Francisco, CA, United States
    Full-time
    From scaling Kubernetes clusters to improving observability with Datadog, we build the tooling and automation that empower product teams to ship with confidence. Collaborate with engineering teams t...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Together • San Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Cypress HCM • San Francisco, CA, United States
    Full-time
    This range is provided by Cypress HCM.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. As a Site Reliability Engineer (Contractor), you will be a...Show more
    Last updated: 5 days ago • Promoted