Site Reliability Engineer - Technical Lead

ZipRecruiterSan Francisco, CA, United States

3 days ago

Job type

Full-time

Job description

Overview

Veryon is a leading software and technology company that enables aviation teams around the world to improve efficiency and safety. Our products maximize uptime for aircraft maintenance teams through customer-driven innovation and world-class service.

With over 7,500 customers across 137 countries, we serve general and business aviation, military / defense, commercial aviation, and OEMs. Our values—Fueled by Customers, Win Together, Make It Happen, Innovate to Elevate—are the foundation of everything we do.

Role

As a hands-on Technical Lead in Site Reliability Engineering, you will be directly responsible for designing, building, and implementing modern reliability practices to ensure uptime, resilience, and production excellence across Veryon’s systems. You’ll work closely with Engineering, DevOps, and Support teams to streamline software delivery to both internal and client environments, troubleshoot production issues, and build observability using Datadog, Dynatrace, and AWS- tools. You will also be a mentor on best practices and a key contributor to reliability-focused architecture and deployment design.

What You’ll Accomplish – Your Performance Objectives

Objective #1 – First 30 Days

Complete onboarding and gain deep understanding of Veryon’s systems, release processes, and deployment environment on AWS.
Review existing application architecture, CI / CD flows, and monitoring implementations.
Begin implementing improvements to observability using Datadog and Dynatrace.
Collaborate with engineers and DevOps to identify bottlenecks in production releases and issue resolution.

Objective #2 – First 90 Days

Build or enhance monitoring dashboards and alerts for critical infrastructure and applications.

Define and begin implementing Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.

Own and improve release workflows and ensure reliable software delivery to customer environments.

Take ownership of investigating production issues, ensuring timely resolution and coordination across teams.

Begin documenting Root Cause Analyses (RCAs) for production incidents and drive preventive improvements.

Partner with DevOps to optimize and automate CI / CD pipelines using GitLab or equivalent.

Objective #3 – First 12 Months

Deliver measurable improvements in system uptime, MTTR, and deployment success rate.

Build self-healing automation and rollback mechanisms for high-risk services.

Standardize and own the RCA process for production incidents to ensure continuous learning.

Implement robust controls and metrics to monitor software delivery health.

Support production readiness of new services through performance baselining and fault testing.

Establish and track health KPIs that inform operational decisions and product improvements.

Key Qualifications

Responsibilities

Implement and manage observability, alerting, and dashboards using Datadog, Dynatrace, and AWS tools.

Take ownership of production deployments, ensuring successful delivery to client environments with minimal disruption.

Troubleshoot and resolve production issues across the stack (infrastructure, application, integration).

Lead Root Cause Analysis (RCA) documentation, follow-ups, and remediation planning.

Define and maintain service SLOs, SLIs, and error budgets with product and engineering teams.

Build automation for deployment, monitoring, incident response, and recovery.

Design CI / CD workflows that support safe and reliable delivery across distributed environments.

Partner with developers to ensure observability and reliability are part of the application design.

Mentor engineers in SRE principles, monitoring strategy, and scalable operations.

Experience and Skills We Seek

6+ years of experience in SRE, DevOps, or platform engineering roles.

Strong hands-on experience with AWS services (e.g., EC2, ECS / EKS, RDS, IAM, CloudWatch, Route 53, ELB, etc.) is required.

Deep familiarity with CI / CD pipelines and deployment strategies using GitLab CI, Jenkins, or equivalent.

Expertise in observability tools such as Datadog and Dynatrace for APM, logging, and alerting.

Solid experience troubleshooting distributed systems in production environments.

Proficiency in scripting and infrastructure as code (e.g., Python, Bash, Terraform, Ansible).

Working knowledge of containers and orchestration (Docker, Kubernetes).

Understanding of SRE principles (SLIs, SLOs, MTTR, incident response, etc.).

Excellent communication and documentation skills, especially for RCA and runbook creation.

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.

How We Work – The Core Values That We Live By

Fueled By Customers – Everything we do is to help our customers increase uptime. Transparent communication and customer empathy drive our decisions.

Win Together – Collaboration across teams is our core strength. We believe every person is vital to our success.

Make It Happen – We take initiative, follow through, and adapt as needed. We take ownership and tackle tough challenges.

Innovate to Elevate – We embrace change, experiment boldly, and continuously improve. We lead by setting a high bar for ourselves and our industry.

#J-18808-Ljbffr

Create a job alert for this search

Site Reliability Engineer • San Francisco, CA, United States

Related jobs

Promoted

Site Reliability Engineer

VirtualVocationsSan Jose, California, United States

Full-time

A company is looking for a Site Reliability Engineer.Key Responsibilities Ensure system reliability and minimize downtime for applications Analyze and optimize system performance and implement t...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Redwood MaterialsSan Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling — keeping critical minerals in circulation and driving the energy transition.Founded in...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

xAIPalo Alto, CA, US

Full-time

AI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering exc...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineering Manager

VirtualVocationsFremont, California, United States

Full-time

A company is looking for a Manager, Software Engineer.Key Responsibilities Define and execute the strategic vision and roadmap for the Site Reliability Engineering function Provide leadership an...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

Rollbar, Inc.San Francisco, CA, United States

Full-time

Wikimedia Foundation is hiring a Senior Site Reliability Engineer (SRE) to join our Service Operations SRE team, where we take care of the infrastructure that runs wikipedia.The SRE team at Wikimed...Show moreLast updated: 27 days ago

Promoted

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Developer

VirtualVocationsConcord, California, United States

Full-time

A company is looking for a Site Reliability Developer.Key Responsibilities Perform DevOps activities to support customers and engineers during release cycles and production Respond to incidents,...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer - Technical Lead

VeryonSan Francisco, CA, US

Full-time

Why We Need You – The Mission & Our Vision.Veryon is a leading software and technology company that enables aviation teams around the world to improve efficiency and safety.Our products m...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Dtex SystemsFremont, CA, US

Full-time

We are excited that you’ve taken the time to explore our business and potentially join us on this incredible journey.We are already the leader in the Insider Risk Management, but our story do...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

CriteoPalo Alto, CA, United States

Full-time

At Criteo we face challenging problems in the IT industry at scale.Our data is large and our systems require speed and complexity handling. We have about 40 petabytes in Hadoop storage and respond t...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

Fractal, Inc.San Francisco, CA, United States

Full-time

This range is provided by Fractal.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Fractal Analytics is a strategic AI partner to Fortune 500 com...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

XaiPalo Alto, CA, United States

Full-time

AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...Show moreLast updated: 1 day ago

Promoted

Principal Site Reliability Engineer

VirtualVocationsFremont, California, United States

Full-time

A company is looking for a Principal Site Reliability Engineer.Key Responsibilities Lead project work to build and maintain platform features for reliability and cloud infrastructure Mentor serv...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

VirtualVocationsFremont, California, United States

Full-time

A company is looking for a Senior Site Reliability Engineer.Key Responsibilities Design and implement infrastructure and automation scripts for AWS deployment and management Optimize and monitor...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

LTD GlobalBerkeley, CA, US

Full-time

We are seeking a Site Reliability Engineer to join our Operations Group.This role plays a key part in advancing scientific discovery by supporting high-performance computing (HPC) and data analysis...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Redwood Materials, Inc.San Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

DaVitaPalo Alto, CA, United States

Full-time

Promoted

Site Reliability Engineer

ZipRecruiterBerkeley, CA, United States

Full-time

Job DescriptionJob Description.We are seeking a Site Reliability Engineer to join our Operations Group.This role plays a key part in advancing scientific discovery by supporting high-performance co...Show moreLast updated: 1 day ago