Talent.com
Site Reliability Engineer

Site Reliability Engineer

Addison GroupFalls Church, VA, US
30+ days ago
Job type
  • Full-time
Job description

Job Description

Job Description

Title : Site Reliability Engineer

Location : Falls Church, VA

Salary : $110,000 - $130,000 / Year

Job Type : Full-Time | Exempt

No sponsorship available

BENEFITS

  • Health, Dental, Vision Insurance
  • 401(k) with immediate vesting
  • Tuition Assistance
  • Public Service Loan Forgiveness (PSLF) eligibility
  • Generous Paid Time Off
  • Dog-friendly office
  • Onsite gym
  • Health Savings Account (HSA) / Flexible Spending Account (FSA)
  • Employee Assistance Program (EAP)
  • Life and Disability Insurance
  • Pet Insurance
  • Trade Publication / Subscription Reimbursement
  • Paid Holidays, Vacation, and Sick Leave
  • Parental Leave

Job Description

We are seeking a Site Reliability Engineer (SRE) to help establish and shape a reliability engineering practice from the ground up. This is a unique opportunity to join a mission-driven environment and play a key role in ensuring the reliability, scalability, and performance of AWS-hosted business applications.

As part of a cross-functional engineering team, you will work to improve observability, automate operational processes, and lead incident response and continuous improvement efforts. This role is ideal for a mid-level engineer with cloud and software engineering experience who is eager to deepen their expertise in site reliability engineering, learn from senior staff, and help build a culture of reliability.

ESSENTIAL DUTIES AND RESPONSIBILITIES

  • Define and implement service-level indicators (SLIs) and service-level objectives (SLOs) for cloud-based applications.
  • Build, configure, and maintain monitoring, alerting, and dashboarding solutions using AWS CloudWatch, X-Ray, and third-party tools such as DataDome.
  • Leverage advanced AWS observability tools (e.g., CloudWatch Synthetics, Contributor Insights) to proactively monitor system health.
  • Contribute to the development and implementation of a structured on-call support process.
  • Implement, monitor, and maintain site protection and bot mitigation solutions to defend against automated attacks and ensure application availability.
  • Investigate incidents, security events, and operational anomalies, perform root cause analysis, and lead postmortem processes.
  • Identify operational inefficiencies (“toil”) and automate workflows using AWS Lambda and CloudFormation.
  • Assist in maintaining and enhancing CI / CD pipelines and deployment processes.
  • Collaborate with development, QA, cloud, and DevOps teams to ensure reliability, scalability, and security are embedded into system designs.
  • Document systems, processes, incident findings, compliance activities, and reliability best practices.
  • Stay current with AWS, SRE, and observability trends and recommend improvements.
  • Evaluate and support the rollout of new AWS services and features.
  • Perform other related duties as assigned.
  • KNOWLEDGE & SKILLS

  • Strong analytical, troubleshooting, and problem-solving abilities.
  • Hands-on experience with AWS CloudWatch (metrics, logs, dashboards, alarms).
  • Familiarity with AWS X-Ray for distributed tracing.
  • Experience with CloudWatch Synthetics and Contributor Insights for proactive testing and analysis.
  • Knowledge of AWS CloudTrail for auditing and investigations.
  • Experience using AWS Athena for log analysis.
  • Proficiency with AWS CloudFormation.
  • Experience automating workflows with AWS Lambda or similar tools.
  • Understanding of AWS services such as API Gateway, CloudFront, and Elastic Load Balancer (ELB).
  • Experience with site protection or bot mitigation tools (e.g., DataDome, Cloudflare).
  • Scripting or programming experience in Python, Bash, or Node.js.
  • Excellent communication and documentation skills.
  • Growth-oriented and eager to adopt emerging tools and practices.
  • REQUIREMENTS

  • Bachelor’s degree in computer science, engineering, or related field (or equivalent experience).
  • 3+ years of experience in cloud engineering, DevOps, infrastructure, or observability (AWS required).
  • Experience applying SRE principles (prior SRE experience preferred).
  • Background in monitoring, incident response, or reliability in production environments.
  • Experience working in Agile, cross-functional teams.
  • Passion for building and improving reliability practices.
  • Create a job alert for this search

    Site Reliability Engineer • Falls Church, VA, US