Talent.com
Senior DevOps and SRE Engineer
Senior DevOps and SRE EngineerBlack Rock Groups • Washington, DC, United States
Senior DevOps and SRE Engineer

Senior DevOps and SRE Engineer

Black Rock Groups • Washington, DC, United States
7 days ago
Job type
  • Full-time
  • Quick Apply
Job description

Randstad is seeking a highly experienced and technically proficient Senior DevOps and Site Reliability Engineer (SRE) to join our client in the DC Metro area. This critical, senior-level role is responsible for driving the reliability, performance, security, and scalability of high-availability production environments on AWS. The ideal candidate is a hands-on technical leader who blends deep expertise in software development, infrastructure-as-code, and observability to automate operational toil, lead capacity planning, and serve as a primary on-call responder for critical incidents. This role demands a strong focus on applying SRE principles (SLIs / SLOs / Error Budgets), mentoring team members, and proactively influencing cross-functional teams to achieve world-class operational excellence.

Responsibilities Deployment & Automation Engineering

  • Implement, maintain, and optimize robust CI / CD pipelines utilizing tools such as GitHub Actions, AWS CodePipeline, and Jenkins.
  • Automate infrastructure provisioning and configuration management using Infrastructure-as-Code (IaC) tools like Terraform, CloudFormation, or AWS CDK.
  • Design and develop automation scripts and self-service tools to significantly enhance development and operational efficiency.
  • Proficiency in multiple programming languages (Python, Go, Java) to develop automation and troubleshoot applications.

Site Reliability & Observability

  • Serve as a production on-call responder, leading incident management and orchestrating critical service outages and disaster recovery failover activities.
  • Facilitate detailed post-mortem meetings and drive systemic improvement patterns across teams.
  • Define, monitor, and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
  • Expertly leverage observability tools (Dynatrace, AppDynamics, ELK Stack, Dynatrace strongly preferred ) for proactive monitoring and troubleshooting.
  • Utilize distributed tracing and context propagation to identify performance bottlenecks and root causes of failures.
  • Design and implement custom dashboards and anomaly detectors to generate actionable insights.
  • Capacity, Performance & Cost Management

  • Develop sophisticated capacity models and forecasting systems to ensure service scalability.
  • Lead cost optimization initiatives, identifying and implementing efficiency gains across cloud services.
  • Design and execute comprehensive Resiliency and Performance testing frameworks.
  • Configure and maintain dynamic auto-scaling policies and thresholds for optimal resource utilization.
  • Security & Governance

  • Lead security incident investigations and execute swift remediation plans.
  • Design and implement automated compliance validation and security automation frameworks.
  • Drive the implementation of zero-trust architecture patterns within the cloud environment.
  • Proficiently apply ITIL framework principles, preferably leveraging ITSM tools such as ServiceNow.
  • Qualifications Education & Experience

  • Bachelor's degree in Computer Science, Engineering, or a related technical field.
  • 5 to 8 years of progressive experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering.
  • 3+ years of experience maintaining and optimizing high-availability production environments.
  • Proven track record of leading complex technical initiatives from conception to completion.
  • Technical Expertise

  • Expert-level knowledge of at least one major cloud platform, with AWS strongly preferred .
  • Deep expertise in cloud architecture, networking, and core services.
  • High proficiency in IaC tools such as Terraform, CloudFormation, or AWS CDK .
  • Expert-level experience with observability and APM tools, with a strong preference for Dynatrace .
  • Proficiency in modern programming languages like Python, Go, or Java .
  • Knowledge of relational, cloud-native, and NoSQL database technologies.
  • Professional & Leadership Skills

  • Strong leadership and mentoring capabilities, with the ability to elevate the technical skills of the team.
  • Exceptional ability to influence without direct authority across engineering and product teams.
  • Excellent technical writing and documentation skills (e.g., RCA development, Knowledge articles).
  • Ability to maintain flexible availability for on-call duties and to work outside of standard business hours as required for incident response.
  • Required Skills :

    Basic Qualification :

    Additional Skills :

    This is a high PRIORITY requisition. This is a PROACTIVE requisition

    Background Check : No

    Drug Screen : No

    Create a job alert for this search

    Engineer Sre • Washington, DC, United States