Talent.com
Staff Site Reliability Engineer
Staff Site Reliability EngineerGradle Technologies • New York, NY, US
Staff Site Reliability Engineer

Staff Site Reliability Engineer

Gradle Technologies • New York, NY, US
30+ days ago
Job type
  • Full-time
Job description

Job Description

Job Description
Who We Are

AI is changing how software gets built. Code production is becoming a commodity. The focus is shifting from writing code to orchestrating, verifying, and governing change – and the toolchain is the new constraint.

Gradle is at the center of this shift. We build Develocity, a toolchain observability and intelligence platform used by some of the world's leading software organizations – Netflix, Airbnb, Spotify, SAP, major global banks, and hundreds more. Develocity helps software teams achieve delivery excellence through deep observability, build and test acceleration, and AI-powered intelligence across the entire toolchain – with current support for Gradle Build Tool, Apache Maven™, sbt, npm, and Python.

We are an AI-native company. AI is not a feature we're bolting on – it's central to how we work, how we think about our product, and where we're heading. We're investing deeply in making Develocity's unique data and decades of domain expertise accessible to both humans and AI agents, with trust, evidence, and explainability at the core of everything we build.

We have partnered with the Apache Software Foundation, the Commonhaus Foundation, the Micronaut Foundation, and other OSS projects such as Spring, Quarkus, Kotlin, JUnit, AndroidX, and many more to bring the values of Develocity also to the OSS Community.

Our Values

Seek to Understand: Everything starts with listening and understanding; we strive to understand diverse viewpoints, problems, and motivations. Before we take action, we ensure we truly grasp the challenges, perspectives, and goals.

Know the Why: We approach our work with a clear sense of purpose, ensuring every step is deliberate and focused. We take meaningful action with urgency, but never at the expense of thoughtful consideration.

Innovate & Iterate: We embrace challenges and are not afraid to try new things, even if they might fail. With a deep understanding and a clear purpose, we can develop creative, bold solutions to tackle challenges.

Own the Outcome: We are empowered to take initiative, and we maintain transparency in our work and its outcomes. When we execute, we take responsibility for our decisions, measure the success of our innovations, and learn from the results.

Who You Are

We're building a new SRE team and looking for founding members to help shape how we operate. As a Lead SRE, you'll be a technical and operational leader for reliability across Develocity. You'll help define our SRE vision, set standards for how we operate production services, and mentor other SREs as the team grows. This is a hands-on role with broad influence across engineering, cloud platform, and customer-facing teams.

The SRE team will be responsible for the reliability, performance, and availability of Develocity instances serving paying customers, open-source projects, and public-facing services, plus supporting infrastructure like artifact registries.

You'll work on our internally-built Cloud Application Platform, Kubernetes on AWS, and develop deep expertise in it. When incidents happen, you'll troubleshoot issues across the stack, from application to infrastructure. You'll collaborate with the Cloud Platform team to improve the tooling you depend on, and with engineering teams to build reliability into how we ship software. If you like automating things and hate doing the same task twice, you'll fit in well.

You'll be part of a distributed, remote-first team that values asynchronous communication and written documentation. Strong self-direction and clear communication across time zones are essential.

Responsibilities
  • Operate and maintain all Develocity instances and supporting services in production.
  • Define and evolve SRE standards, practices, and operating models, including on-call, incident response, postmortems, and SLOs.
  • Participate in a follow-the-sun on-call rotation, acting as a technical escalation point for complex or high-severity incidents.
  • Lead incident response and blameless retrospectives, ensuring learnings result in measurable reliability improvements.
  • Set reliability priorities using risk, customer impact, business goals, SLOs, and error budgets.
  • Identify systemic reliability risks and continuously evolve Develocity's SaaS operations as the platform and customer base grow.
  • Lead and influence architectural and design reviews to ensure reliability, scalability, and operability.
  • Drive automation across deployment, upgrades, monitoring, self-healing, recovery, and operational workflows.
  • Build and maintain comprehensive observability for all managed services, including logging, metrics, tracing, and alerting.
  • Own disaster recovery, backups, and business continuity planning and execution.
  • Partner with engineering leadership to balance feature delivery with reliability and operational excellence.
  • Mentor and coach SREs, supporting technical growth and strong operational practices.
  • Help onboard new SREs and contribute to hiring by defining and assessing SRE excellence at Develocity.
  • Communicate clearly with customers during incidents and maintenance windows.
  • Optimize performance, resource utilization, and operational costs.
Minimum qualifications
  • 7+ years in SRE, DevOps, or an equivalent role operating production services at scale.
  • Experience leading reliability initiatives across multiple teams or services.
  • Demonstrated ability to influence technical direction without direct authority.
  • Experience designing and operating systems with SLOs and error budgets, and exercising strong judgment in balancing reliability, velocity, and cost.
  • Strong Kubernetes experience in production environments.
  • Cloud infrastructure expertise, preferably AWS (EKS, RDS, S3, EC2).
  • Proficiency with observability tools (Prometheus, Grafana) and Infrastructure as Code (Terraform).
  • Track record of incident management and response in a 24/7 on-call environment.
  • Scripting proficiency (Python, Bash) for automation.
  • Strong written and verbal English communication skills.
Preferred qualifications
  • Experience as a founding or early SRE establishing practices in a growing SaaS organization.
  • Familiarity with Develocity.
  • JVM language experience (Java, Kotlin).
  • Experience with customer-facing and executive-level incident communications.
What We Offer
  • A ground-floor role in a new SRE team - you'll shape how we do things, not inherit someone else's decisions.
  • Real ownership of production systems used by engineers at companies you've heard of.
  • Direct interaction with customers when things go wrong (and when they go right).
  • A culture that values automation over heroics.
  • In-person meetings, such as our annual company offsite and team meetings.
  • Work from home in a remote-first environment.
  • Competitive salaries and equity grants.
Compensation

The US salary range for this position is $180-220k which reflects the target ranges for all US locations. Within this range, individual pay is determined by geographic location and additional factors including but not limited to experience, relevant skills, qualifications, seniority, performance, and travel requirements. Our recruiting team can share more information about the specific salary range for your location during the hiring process.

Location
  • Remote from anywhere in EST timezone.
  • While our team works remotely and is spread across the globe, we deeply value daily interactions and collaboration.

Create a job alert for this search

Staff Site Reliability Engineer • New York, NY, US

Similar jobs

Senior Reliability Engineer

JLLJersey City, NJ, United States
Full-time

JLL empowers you to shape a brighter way.Our people at JLL are shaping the future of real estate for a better world by combining world class services, advisory and technology for our clients.We are...Show more

 • Promoted

Senior DevOps and Site Reliability Engineer, remote

CherreNew York City, NY, United States
Remote
Full-time

Cherre is the real estate industry's leading data management platform, powering more than $3 trillion AUM globally.Our end-to-end platform helps clients connect, transform, analyze, and act on trus...Show more

 • Promoted

Reliability Engineer

Mini-CircuitsNew York, NY, United States
Full-time

Mini-Circuits designs, manufactures and distributes integrated circuits, modules, and sub-systems for high-performance radio frequency (RF) and microwave applications.With design, sales and manufac...Show more

 • Promoted

DevOps & Site Reliability Engineer - AWS / Terraform / Laravel - Remote

SportsRecruitsNew York City, NY, United States
Remote
Full-time

OverviewDevOps / Site Reliability Engineer (Remote)Location :Remote (US-based)Reports to :CTO, SportsRecruitsAbout SportsRecruitsSportsRecruits is the leading sports recruiting network, connecting ...Show more

 • Promoted

Staff Software Engineer

Pearson Education ServicesHoboken, NJ, United States
Full-time

Pearson Learning Studio is seeking a Principal Engineer to lead the technical vision and architecture for our Guided Study platform.This role is designed for a deeply technical leader who thrives a...Show more

 • Promoted

Lead Site Reliability Engineer

JPMorgan Chase Bank, N.A.New York, NY, United States
Full-time

As a Site Reliability Engineering at JPMorgan Chase within the Enterprise technology, liquidity risk team, you are the non-functional requirement owner and champion for the applications in your rem...Show more

 • Promoted

Trade and Industry - Entry Level Training Programs

DreamboundFair Lawn, New Jersey, United States
Full-time

Note: This is an educational program, not a job.Successful completion of the program does not guarantee employment but will equip you with valuable skills for the trades and industry job market.Are...Show more

 • Promoted

Sr Staff Systems Engineer

ZT SystemsSecaucus, NJ, United States
Permanent

Staff IT Systems Engineer will work with our team to provide support and advance a wide range of infrastructure and services throughout the global IT environment.Staff IT Systems Engineer, you will...Show more

 • Promoted

Site Reliability Engineer, Commodities Technology

Point72New York, NY, United States
Full-time

Site Reliability Engineer, Commodities Technology.A Career with point72's technology team.As Point72 reimagines the future of investing, our Technology group is constantly improving our company's I...Show more

 • Promoted

Site Reliability Engineer

Omni InclusiveSecaucus, NJ, United States
Full-time

Education and Certification Requirements-.Windows Server engineering and Active Directory Services.S) or an equivalent with extensive work experience.Required Skills & Technical Knowledge-.ADFS, Az...Show more

 • Promoted

Site Reliability Engineer

Lorven technologiesNew York City, NY, United States
Full-time
Quick Apply

Our client is looking Site Reliability Engineer project NYC, NY (Hybrid) below is the detailed requirements.Job Title : Site Reliability Engineer Locati...Show more

Site Reliability Engineer - (Linux & Python/Go)

Elliot PartnershipNew York, NY, United States
Full-time

Site Reliability Engineer - (Linux & Python/Go).New York, NY (Hybrid, 3 days in office).Highly competitive compensation package.Join an elite technology and research group at the forefront of globa...Show more

 • Promoted

Staff Engineer

VitallyBrooklyn, NY, US
Full-time
Quick Apply

We are seeking a seasoned and innovative Staff Engineer to join our engineering team.As a technical leader, you will play a pivotal role in designing and implementing high-impact solutions, mentori...Show more

Staff Full-Stack Engineer — Remote, Equity, Impact

OfficeHours Technologies Co.New York, NY, United States
Full-time

A dynamic tech company is seeking a Staff Full Stack Software Engineer to lead the design and implementation of user-facing features.With a strong focus on both frontend and backend technologies, y...Show more

 • Promoted

Product Reliability Engineer - Defense

Palantir TechnologiesNew York, NY, United States
Permanent

Palantir builds the world's leading software for data-driven decisions and operations.By bringing the right data to the people who need it, our platforms empower our partners to develop lifesaving ...Show more

 • Promoted

Senior DevOps / Site Reliability Engineer (Terraform)

Purple DriveNew York, NY, United States
Full-time

Job Title: Senior DevOps / Site Reliability Engineer (Terraform).We are seeking a highly skilled DevOps / Site Reliability Engineer (SRE) to join our team in New York.The ideal candidate will have ...Show more

 • Promoted

Platform Reliability Engineer

TWG Global AINew York, NY, United States
Full-time

At TWG Group Holdings, LLC (“TWG Global”), we drive innovation and business transformation across a range of industries—including financial services, insurance, technology, media, and sports—by lev...Show more

 • Promoted

Lead Site Reliability Engineer (Remote)

LivepeerNew York City, NY, United States
Remote
Full-time

Location :RemoteHours :North America working hoursAbout LivepeerLivepeer is on a mission to build the world's open video infrastructure.Founded in 2017, it is the world's first open-source protocol...Show more