Talent.com
Staff Site Reliability Engineer
Staff Site Reliability EngineerTopstep • United States, United States, United States
Staff Site Reliability Engineer

Staff Site Reliability Engineer

Topstep • United States, United States, United States
20 days ago
Job type
  • Full-time
Job description

Summary

Are you a systems-minded engineer who thrives on building resilient infrastructure, driving operational excellence, and enabling teams to move fast with confidence? As a Staff Site Reliability Engineer at Topstep, you'll play a foundational role in shaping how we approach reliability, observability, and infrastructure at scale. You'll be instrumental in building out our SRE practice, defining our incident response culture, closing observability gaps, and optimizing our AWS infrastructure for both performance and cost. This role is ideal for someone who brings both deep technical expertise and a builder's mindset. Someone who's excited to establish best practices from the ground up, embed reliability into engineering culture, and create the foundations that let teams ship with speed and confidence. Join us and help define what operational excellence looks like at Topstep.

Key Responsibilities

  • Set technical direction for reliability and observability across the entire engineering organization, influencing architectural decisions.
  • Build and mature our SRE practice defining SLOs, incident response protocols, and on-call standards
  • Own the observability stack using DataDog (primary platform for metrics, APM, logging) and CloudWatch (AWS-native monitoring), instrumenting distributed tracing and closing gaps that currently prevent diagnosis of production issues
  • Partner with engineering teams to embed reliability principles early in the design process and improve system resilience
  • Lead incident response and blameless post-mortems , turning outages into opportunities for systematic improvement
  • Mentor engineers across the organization on reliability practices, operational thinking, and production ownership
  • Champion a culture of transparency, continuous improvement, and shared ownership of production systems

Required Qualifications and Key Competencies

  • 7+ years of professional experience in SRE, infrastructure, or platform engineering, with demonstrated impact building practices that scaled across multiple teams
  • Proven track record either starting an SRE function from scratch or scaling an existing practice with measurable improvements to MTTR, MTTD, change failure rate, or availability
  • Strong proficiency with DataDog for end-to-end observability (metrics, APM, logs, distributed tracing) and building alerting that catches real issues without causing fatigue
  • Deep expertise with AWS infrastructure (EKS, ECS, EC2, and RDS) running production services at scale, and hands-on experience optimizing for both reliability and cost
  • Solid foundation in distributed systems, networking, database performance, and debugging complex system failures across service boundaries
  • Comfortable reading code, writing automation scripts, and contributing to infrastructure tooling when needed
  • Proficiency with infrastructure as code (Terraform) and GitOps practices
  • Track record of influencing engineering culture through documentation, tooling, mentorship, and technical leadership
  • Excellent communication skills with the ability to explain complex system behavior and trade-offs to varied audiences
  • Comfortable making pragmatic trade-offs between long-term platform vision and immediate business needs
  • Company Culture & Perks

  • Topstep is an engaging working environment which ranges from fully remote to hybrid. We foster a culture of collaboration with cameras on during meetings and a robust Slack environment for communication.
  • 10 Company paid Holidays and generous Family Leave. Paid time off is accrued monthly.
  • Competitive 401(k) matching, health, dental, and vision insurance is offered for full time employees
  • Vacations are encouraged with a bonus for taking 5 consecutive days. Employee referrals are bonused. Topstep offers a food and groceries budget and contributes towards health and wellness.
  • New Hire Base Salary Range

  • $200,000-$250,000
  • Bonus : This position is eligible for a performance-based bonus as provided by the plan terms and governing documents.
  • The compensation offered will take into account internal compensation structure and may vary depending on the candidate's geographic region, job-related knowledge, skills, and experience among other factors.
  • Equal Opportunity Employer

    Topstep is an Equal Opportunity Employer. We are committed to fostering an inclusive environment where all employees and applicants are valued. All qualified candidates will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, age, disability, or veteran status, in compliance with applicable federal, state, and local laws.

    Interested in the role? Apply today with your resume and cover letter!

    At this time immigration sponsorship is not available for this position (including H-1B, STEM OPT training plans, etc.).

    Create a job alert for this search

    Site Reliability Engineer • United States, United States, United States

    Related jobs
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    Expel • Remote, Remote, United States
    Remote
    Full-time
    Your passion for uptime was forged from experience in production and refined through incident response.You’re an Expel Principal Site Reliability Engineer - a protector, champion, and leader of Exp...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Real Time Technologies • Remote, Remote, United States
    Remote
    Full-time
    Realtime technologies, LLC offers the most flexible cutting-edge Retail Management Solutions that encompass sales, inventory management, frontline employee management and engagement, payments, busi...Show more
    Last updated: 20 days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Prove • United States, United States, United States
    Full-time
    As the world moves to a mobile-first economy, businesses need to modernize how they acquire, engage with and enable consumers. Prove’s phone-centric identity tokenization and passive cryptographic a...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer (SRE)

    Site Reliability Engineer (SRE)

    Lightfeather.io • United States, United States, United States
    Full-time
    LightFeather is seeking a Site Reliability Engineer (SRE) with strong GitLab platform expertise to support and enhance enterprise DevSecOps and collaboration environments.The ideal candidate thrive...Show more
    Last updated: 30+ days ago • Promoted
    Staff Site Reliability Engineer

    Staff Site Reliability Engineer

    Branch Metrics • Remote, Remote, United States
    Remote
    Full-time
    At Branch, we’re transforming how brands and users interact across digital platforms.Our mobile marketing and deep linking solutions are trusted to deliver seamless experiences that increase ROI, d...Show more
    Last updated: 30+ days ago • Promoted
    Staff Site Reliability Engineer - Platform

    Staff Site Reliability Engineer - Platform

    Ionq • Remote, Remote, United States
    Remote
    Full-time +1
    IonQ is developing the world's most powerful full-stack quantum computer based on trapped-ion technology.We are pushing past the limits of classical physics and current supercomputing technology to...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Cutover • Remote, Remote, United States
    Remote
    Full-time
    An inclusive work environment is an empowering one.At Cutover, we lead with empathy and enable others to succeed through curiosity, kindness, and self-expression. Location : Remote, United States.Shi...Show more
    Last updated: 30+ days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Appomni • Remote, Remote, United States
    Full-time
    AppOmni, a leader in SaaS Security, helps customers achieve secure productivity with their applications.Security teams and owners can quickly detect and mitigate threats using unmatched depth of pr...Show more
    Last updated: 21 days ago • Promoted
    Senior / Principal Site Reliability Engineer

    Senior / Principal Site Reliability Engineer

    Datacrunch • Remote, Remote, United States
    Remote
    Full-time +1
    Imagine a future where everyone has instant, low-cost access to intelligence.We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI m...Show more
    Last updated: 29 days ago • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    Blue River Technology • Remote, Remote, United States
    Remote
    Full-time
    We’re Blue River, a team of innovators driven to create intelligent machinery that solves monumental problems for our customers. We empower our customers – farmers, construction crews, and foresters...Show more
    Last updated: 30+ days ago • Promoted
    Staff Site Reliability Engineer

    Staff Site Reliability Engineer

    Sentinelone • Remote, Remote, United States
    Remote
    Full-time
    Please note that under Federal & FedRAMP regulations, hiring for this role is limited to US citizens only.FedRamp Staff may be subject to customer or third-party background checks up to and includi...Show more
    Last updated: 1 day ago • Promoted
    Staff Site Reliability Engineer-Federal, Security Clearance

    Staff Site Reliability Engineer-Federal, Security Clearance

    Zscaler • Remote, Remote, United States
    Remote
    Full-time
    Serving thousands of enterprise customers around the world including 45% of Fortune 500 companies, Zscaler (NASDAQ : ZS) was founded in 2007 with a mission to make the cloud a safe place to do busin...Show more
    Last updated: 15 days ago • Promoted
    Senior Site Reliability Engineer - Growth

    Senior Site Reliability Engineer - Growth

    Kraken • United States, United States, United States
    Remote
    Full-time
    Building the Future of Crypto .Our Krakenites are a world-class team with crypto conviction, united by our desire to discover and unlock the potential of crypto and blockchain technology.Kraken is ...Show more
    Last updated: 5 days ago • Promoted
    Reliability Engineer

    Reliability Engineer

    MCC • US
    Full-time
    Build Your Career with an Industry Leader.As the global leader of premium labels, Multi-Color Corporation (MCC) helps brands stand out in competitive markets and inspire positive consumer experienc...Show more
    Last updated: 1 day ago • Promoted
    Senior Site Reliability Engineer, Arlington

    Senior Site Reliability Engineer, Arlington

    Onebrief • Remote, Remote, United States
    Full-time
    Onebrief is collaboration and AI-powered workflow software designed specifically for military staffs.By transforming this work, Onebrief makes the staff as a whole superhuman - meaning faster, smar...Show more
    Last updated: 5 days ago • Promoted
    Staff / Principal Site Reliability Engineer

    Staff / Principal Site Reliability Engineer

    Veza Technologies • Remote, Remote, United States
    Remote
    Full-time
    Staff / Principal Site Reliability Engineer.You'll architect scalable solutions, navigate complex technical challenges independently, and deliver results under tight deadlines in a fast-paced environ...Show more
    Last updated: 30+ days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Sciencelogic • Remote, Remote, United States
    Remote
    Full-time
    ScienceLogic is redefining IT operations for the modern enterprise.Our AIOps platform empowers organizations to achieve Autonomic IT — where systems are self-healing, self-optimizing, and seamlessl...Show more
    Last updated: 8 days ago • Promoted
    ANF - Site Reliability Engineer - MABSM

    ANF - Site Reliability Engineer - MABSM

    Shee Atika Government Services Careers • Remote, Remote, United States
    Remote
    Full-time
    Alaska Northstar Federal is currently seeking a Site Reliability Engineer to join the team on a long-term project.This is a fully remote opportunity, but preference will be to have a candidate that...Show more
    Last updated: 30+ days ago • Promoted