Senior DevOps and SRE EngineerBlack Rock Groups • Washington, DC, United States

Senior DevOps and SRE Engineer

Black Rock Groups • Washington, DC, United States

7 days ago

Job type

Full-time

Quick Apply

Job description

Randstad is seeking a highly experienced and technically proficient Senior DevOps and Site Reliability Engineer (SRE) to join our client in the DC Metro area. This critical, senior-level role is responsible for driving the reliability, performance, security, and scalability of high-availability production environments on AWS. The ideal candidate is a hands-on technical leader who blends deep expertise in software development, infrastructure-as-code, and observability to automate operational toil, lead capacity planning, and serve as a primary on-call responder for critical incidents. This role demands a strong focus on applying SRE principles (SLIs / SLOs / Error Budgets), mentoring team members, and proactively influencing cross-functional teams to achieve world-class operational excellence.

Responsibilities Deployment & Automation Engineering

Implement, maintain, and optimize robust CI / CD pipelines utilizing tools such as GitHub Actions, AWS CodePipeline, and Jenkins.
Automate infrastructure provisioning and configuration management using Infrastructure-as-Code (IaC) tools like Terraform, CloudFormation, or AWS CDK.
Design and develop automation scripts and self-service tools to significantly enhance development and operational efficiency.
Proficiency in multiple programming languages (Python, Go, Java) to develop automation and troubleshoot applications.

Site Reliability & Observability

Serve as a production on-call responder, leading incident management and orchestrating critical service outages and disaster recovery failover activities.

Facilitate detailed post-mortem meetings and drive systemic improvement patterns across teams.

Define, monitor, and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.

Expertly leverage observability tools (Dynatrace, AppDynamics, ELK Stack, Dynatrace strongly preferred ) for proactive monitoring and troubleshooting.

Utilize distributed tracing and context propagation to identify performance bottlenecks and root causes of failures.

Design and implement custom dashboards and anomaly detectors to generate actionable insights.

Capacity, Performance & Cost Management

Develop sophisticated capacity models and forecasting systems to ensure service scalability.

Lead cost optimization initiatives, identifying and implementing efficiency gains across cloud services.

Design and execute comprehensive Resiliency and Performance testing frameworks.

Configure and maintain dynamic auto-scaling policies and thresholds for optimal resource utilization.

Security & Governance

Lead security incident investigations and execute swift remediation plans.

Design and implement automated compliance validation and security automation frameworks.

Drive the implementation of zero-trust architecture patterns within the cloud environment.

Proficiently apply ITIL framework principles, preferably leveraging ITSM tools such as ServiceNow.

Qualifications Education & Experience

Bachelor's degree in Computer Science, Engineering, or a related technical field.

5 to 8 years of progressive experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering.

3+ years of experience maintaining and optimizing high-availability production environments.

Proven track record of leading complex technical initiatives from conception to completion.

Technical Expertise

Expert-level knowledge of at least one major cloud platform, with AWS strongly preferred .

Deep expertise in cloud architecture, networking, and core services.

High proficiency in IaC tools such as Terraform, CloudFormation, or AWS CDK .

Expert-level experience with observability and APM tools, with a strong preference for Dynatrace .

Proficiency in modern programming languages like Python, Go, or Java .

Knowledge of relational, cloud-native, and NoSQL database technologies.

Professional & Leadership Skills

Strong leadership and mentoring capabilities, with the ability to elevate the technical skills of the team.

Exceptional ability to influence without direct authority across engineering and product teams.

Excellent technical writing and documentation skills (e.g., RCA development, Knowledge articles).

Ability to maintain flexible availability for on-call duties and to work outside of standard business hours as required for incident response.

Required Skills :

Basic Qualification :

Additional Skills :

This is a high PRIORITY requisition. This is a PROACTIVE requisition

Background Check : No

Drug Screen : No

Create a job alert for this search

Engineer Sre • Washington, DC, United States