Talent.com
Sr. Site Reliability Engineer - Incident Response
Sr. Site Reliability Engineer - Incident ResponseCox Automotive • Peachtree Corners, GA, United States
Sr. Site Reliability Engineer - Incident Response

Sr. Site Reliability Engineer - Incident Response

Cox Automotive • Peachtree Corners, GA, United States
30+ days ago
Job type
  • Full-time
Job description

The Site Reliability Engineer - Incident Response is a critical enterprise-level role responsible for accelerating incident resolution and enhancing the overall incident management process. This individual partners with engineering teams during active incidents to troubleshoot issues using monitoring and logging tools, and post-incident, delivers executive-level summaries that clearly communicate impact, root cause, and resolution. The SRE - Incident Response also plays a key role in analyzing incident response effectiveness and identifying opportunities for systemic improvements. Core Competencies and Qualifications : Bachelor's degree in a related discipline and 4 years' experience in a related field. The right candidate could also have a different combination, such as a master's degree and 2 years' experience; a Ph.D. and up to 1 year of experience; or 16 years' experience in a related field. Applicants must currently be authorized to work in the United States for any employer without current or future sponsorship. No OPT, CPT, STEM / OPT or visa sponsorship now or in future. Engineering / Tooling : Demonstrates the ability to design, build, and maintain engineering solutions and tools that enhance reliability, automate incident response, and reduce operational toil. Incident Troubleshooting : Skilled in interpreting logs, metrics, and traces to assist in identifying root causes during live incidents. Monitoring & Observability : Proficient in tools such as Datadog, Splunk, New Relic, or similar platforms. Strong programming background in Python, Java, or C# , with experience building, maintaining, and troubleshooting production-grade services and automation tools. Proven ability to design and implement reliable, scalable, and highly available systems, leveraging software engineering best practices to improve system resilience and operational efficiency. Experience developing automation and tooling to reduce toil, improve incident response, and support continuous improvement across monitoring, deployment, and recovery processes. Ability to collaborate closely with software engineering teams to influence architecture and operational readiness, ensuring reliability is built into the system from design through production. AI Centric Engineering : Effectively leverages artificial intelligence (AI) and machine learning (ML) tools to automate, optimize, and enhance daily engineering and incident response tasks. Analytical Rigor : Strong attention to detail in validating incident data and identifying trends or gaps in response. DevOps & Architecture Knowledge : Understanding full-stack systems, CI / CD pipelines, caching, scaling, and cloud-native infrastructure. Metrics & Reporting : Capable of calculating and interpreting key metrics like MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve). Here are the responsibilities of this role when not tied to active on-call : Post-Incident Review Development Draft and deliver executive summaries post-incident Develop and coach teams on blameless postmortems . Create templates, train facilitators, and help guide root cause analysis (e.g., 5 Whys, fishbone diagrams). Maintain a central library of learnings and cross-cutting themes. Incident Process Improvement Actively support engineering teams during incidents by helping diagnose and resolve issues quickly Navigate and analyze data from observability platforms to make informed inferences about root causes Analyze the effectiveness of incident response to identify systemic reliability gaps. Standardize incident response workflows (incident roles, comms, escalation paths). Create or refine runbooks , incident command frameworks , and severity classification guides . Metrics and Insights Build dashboards around incident frequency, MTTR, MTTA, and recurrence rates. Use incident data to drive reliability of OKRs or engineering investments. Tooling & AI Solutions Partner with engineering teams to identify repetitive or high-impact tasks suitable for automation. Develop, implement, and continuously improve custom scripts, bots, and AI-driven workflows for monitoring, alerting, and incident triage. Evaluate and integrate emerging AI / ML technologies to optimize detection, root cause analysis, and reporting. Ensure all tools and automations are secure, maintainable, and aligned with organizational standards and SRE best practices. Document and socialize new tools and AI solutions, enabling adoption and knowledge sharing across teams. Cross-Team Collaboration Collaborate with Engineering Managers and Incident Commanders to gather and validate incident data Partner with product teams, infra, and leadership to socialize reliability best practices . Act as a reliability "consultant" to squads that have impactful incidents. Recommend enhancements to monitoring, alerting, and response processes to reduce future incident impact USD 101,500.00 - 169,100.00 per year Compensation : Compensation includes a base salary of $101,500.00 - $169,100.00. The base salary may vary within the anticipated base pay range based on factors such as the ultimate location of the position and the selected candidate's knowledge, skills, and abilities. Position may be eligible for additional compensation that may include an incentive program. Benefits : The Company offers eligible employees the flexibility to take as much vacation with pay as they deem consistent with their duties, the company's needs, and its obligations; seven paid holidays throughout the calendar year; and up to 160 hours of paid wellness annually for their own wellness or that of family members. Employees are also eligible for additional paid time off in the form of bereavement leave, time off to vote, jury duty leave, volunteer time off, military leave, and parental leave.aa415a4b-8b21-40fc-a65c-70d2b25ca29a

Create a job alert for this search

Sr Site Reliability Engineer Incident Response • Peachtree Corners, GA, United States

Similar jobs
Site Reliability Engineer

Site Reliability Engineer

AutoRABIT Holding Inc. • Atlanta, GA, US
Permanent
Quick Apply
AutoRABIT is looking for a Site Reliability / DevSecOps Engineer to help develop, scale and operate our cloud services In this role you will be an experienced business professional able to impl...Show more
Last updated: 30+ days ago
Kitchen Leader

Kitchen Leader

Chipotle Mexican Grill • Jasper, GA, United States
Full-time
CULTIVATE A BETTER WORLD Food served fast does not have to be a typical fast-food experience.Chipotle has always done things differently, both in and out of our restaurants.We are changing the fac...Show more
Last updated: 30+ days ago • Promoted
Senior DevOps Engineer, Infrastructure & Reliability

Senior DevOps Engineer, Infrastructure & Reliability

Worth AI • Atlanta, GA, US
Remote
Full-time
Quick Apply
Worth AI, a leader in the computer software industry, is looking for a Senior DevOps Engineer to join our Infrastructure team with a singular mission : to make our systems faster, more reliable, and...Show more
Last updated: 8 days ago
Construction Safety Manager

Construction Safety Manager

Safety Consultants USA Inc • Atlanta, GA, US
Full-time
CONSTRUCTION SAFETY MANAGER who prides themselves on their ability to think creatively and provide innovative solutions to complex problems to join our team. This role is a great opportunity to crea...Show more
Last updated: 22 hours ago • Promoted • New!
Remote Bilingual Civil Engineer : Rooftop Load Analysis

Remote Bilingual Civil Engineer : Rooftop Load Analysis

DATAMTX LLC • Atlanta, GA, United States
Remote
Full-time
A leading engineering firm in Chile is seeking multiple Civil Engineers, focusing on load analysis and structural conditions. Candidates should have a Bachelor's degree in engineering and strong Eng...Show more
Last updated: 16 days ago • Promoted
PLM Teamcenter Engineer

PLM Teamcenter Engineer

Bright Vision Technologies • Atlanta, GA, US
Full-time
Quick Apply
PLM Teamcenter Engineer Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their ope...Show more
Last updated: 15 days ago
Travel Nuclear Medicine Tech - $2,027 to $2,247 per week in Canton, GA

Travel Nuclear Medicine Tech - $2,027 to $2,247 per week in Canton, GA

AlliedTravelCareers • Canton, GA, US
Full-time
AlliedTravelCareers is working with LRS Healthcare to find a qualified Nuclear Medicine Tech in Canton, Georgia, 30115!.Ready to start your next travel adventure? LRS Healthcare offers a full benef...Show more
Last updated: 6 days ago • Promoted
Life Safety Equipment Tech II

Life Safety Equipment Tech II

InsideHigherEd • Atlanta, Georgia, United States
Full-time +1
Georgia Tech prides itself on its technological resources, collaborations, high-quality student body, and its commitment to building an outstanding and diverse community of learning, discovery, and...Show more
Last updated: 19 days ago • Promoted
Onsite Technology Testers Needed (ATL) - $120 Payout

Onsite Technology Testers Needed (ATL) - $120 Payout

uTest by Applause • Canton, GA, US
Full-time +1
We are a company that helps businesses test the accuracy and usability of their websites, applications, and hardware through freelance software testing and feedback. You can get paid to help us shap...Show more
Last updated: 7 days ago • Promoted
Chief EHS & Construction Strategy Leader

Chief EHS & Construction Strategy Leader

T5 Data Centers • Atlanta, GA, United States
Full-time
A leading data center solutions provider is seeking a Vice President of EHS – Construction to provide executive leadership across environmental, health, and safety initiatives.The role demands stra...Show more
Last updated: 30+ days ago • Promoted
(SRE) Site Reliability Engineer / GCP

(SRE) Site Reliability Engineer / GCP

Sysmind LLC • Alpharetta, GA, United States
Full-time
Quick Apply
SRE) Site Reliability Engineer / GCP Strong hands on experience with Terraform, Kubernetes, and Google Cloud Platform (GCP) Experience working with Je...Show more
Last updated: 6 days ago
Supervisor, Operations Relief

Supervisor, Operations Relief

ArcBest • Atlanta, GA, United States
Full-time
Job Description The Supervisor, Operations Relief I travels to ABF facilities to fill vacant supervisor positions through the ABF system, as assigned by the Manager, Relief Supervision.The Supervi...Show more
Last updated: 9 days ago • Promoted
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Bright Vision Technologies • Atlanta, GA, US
Full-time
Quick Apply
Site Reliability Engineer (SRE) Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize t...Show more
Last updated: 16 days ago
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Rivka Development • Atlanta, GA, USA
Full-time
Quick Apply
SRE will work within the Video Network division to design, build, operate our next generation Video Cloud platform, driving efficiency, reliability and scalability across our cloud infrastructure.W...Show more
Last updated: 30+ days ago
System Reliability Engineer, IV or V

System Reliability Engineer, IV or V

Georgia Transmission Corporation • Tucker, GA, USA
Full-time
Quick Apply
Performs System Reliability functions to improve and enhance the reliability performance of the transmission system to meet the needs of Members and corporate goals. Identifies ways to best utilize ...Show more
Last updated: 30+ days ago
Appliance Repair (Atlanta)

Appliance Repair (Atlanta)

Lula • Atlanta, Georgia, United States
Full-time
Quick Apply
We are seeking individuals who have experience in the rental property industry and has an eager attitude.Lula is a service designed for property managers to eliminate the hassle of managing and coo...Show more
Last updated: 30+ days ago
Senior Systems Engineer-Project Lead

Senior Systems Engineer-Project Lead

Secmation • Atlanta, GA, USA
Full-time
Quick Apply
Senior Systems Engineer - Project Lead.Position Type : Full-Time | Hybrid.Relocation Assistance (if applicable).Secmation is a proven, mission-focused engineering company with more than a decade of ...Show more
Last updated: 30+ days ago
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

JPS Tech Solutions LLC • Atlanta, GA, United States
Full-time
Quick Apply
Job Summary : We are looking for a highly experienced Site Reliability Engineer (SRE) with 12+ years of experience to support and enhance the reliability, scalability, and perf...Show more
Last updated: 7 days ago