Job Description
We are looking for a Senior Site Reliability Engineer (SRE) with deep experience in AWS infrastructure, automation, observability, and production support. As an SRE, you will ensure our cloud-native systems are resilient, scalable, and efficient, driving reliability through code, not just processes.
Requirements
Key Responsibilities :
Design, implement, and maintain scalable, secure, and highly available infrastructure on AWS
Develop and improve CI / CD pipelines, Infrastructure as Code (IaC) using Terraform, Harness
Own and implement monitoring, alerting, logging, and distributed tracing with tools like Dynatrace / Datadog
Troubleshoot production incidents, conduct blameless postmortems, and improve incident response processes
Optimize systems for cost, performance, and reliability
Drive chaos engineering and resilience testing
Collaborate with development teams to embed SRE practices like SLAs, SLOs, and error budgets
Mentor junior SREs and promote DevOps / SRE culture across the organization
Basic Qualifications :
Strong experience in SRE, DevOps, or Cloud Engineering
Expertise in AWS core services (EC2, ECS / EKS, Lambda, S3, VPC, RDS, IAM, CloudFront, etc.)
Hands-on experience with Terraform, Ansible, or other IaC tools
Strong scripting / coding skills (Python, Go, Shell, etc.)
Experience with Kubernetes, containerization, and orchestration
Deep knowledge of Linux systems and networking
Preferred Qualifications :
Experience with Service Meshes (e.g., Istio, App Mesh)
Familiarity with AWS Well-Architected Framework
Experience building self-healing systems and automated remediation
Background in security, compliance, or multi-account / multi-region AWS architectures
Certifications (Optional / Preferred) :
AWS Certified DevOps Engineer – Professional
AWS Certified Solutions Architect – Professional
Requirements
Key Responsibilities : Design, implement, and maintain scalable, secure, and highly available infrastructure on AWS Develop and improve CI / CD pipelines, Infrastructure as Code (IaC) using Terraform, Harness Own and implement monitoring, alerting, logging, and distributed tracing with tools like Dynatrace / Datadog Troubleshoot production incidents, conduct blameless postmortems, and improve incident response processes Optimize systems for cost, performance, and reliability Drive chaos engineering and resilience testing Collaborate with development teams to embed SRE practices like SLAs, SLOs, and error budgets Mentor junior SREs and promote DevOps / SRE culture across the organization Basic Qualifications : Strong experience in SRE, DevOps, or Cloud Engineering Expertise in AWS core services (EC2, ECS / EKS, Lambda, S3, VPC, RDS, IAM, CloudFront, etc.) Hands-on experience with Terraform, Ansible, or other IaC tools Strong scripting / coding skills (Python, Go, Shell, etc.) Experience with Kubernetes, containerization, and orchestration Deep knowledge of Linux systems and networking Preferred Qualifications : Experience with Service Meshes (e.g., Istio, App Mesh) Familiarity with AWS Well-Architected Framework Experience building self-healing systems and automated remediation Background in security, compliance, or multi-account / multi-region AWS architectures Certifications (Optional / Preferred) : AWS Certified DevOps Engineer – Professional AWS Certified Solutions Architect – Professional
Reliability Engineer • Chicago, IL, us