Job Title : Principal Site Reliability Engineer
Location : Washington, D.C.
Employment Type : Contract
About US :
DMV IT Service LLC, founded in 2020, is a trusted IT consulting firm specializing in IT infrastructure optimization, cybersecurity, networking, and staffing solutions. We partner with clients to achieve technology goals through expert guidance, workforce support, and innovative solutions. With a client-focused approach, we also provide online training and job placements, ensuring long-term IT success.
Job Purpose :
We are seeking a highly skilled Principal Site Reliability Engineer to lead and elevate the reliability, scalability, and security of critical infrastructure systems. This position requires a seasoned technical professional with deep expertise in infrastructure automation (IaC) , CI / CD architecture , and cloud security , combined with hands-on experience in Site Reliability Engineering (SRE) principles such as SLOs, error budgets, and incident management. The ideal candidate will provide technical leadership, mentor cross-functional teams, and ensure systems are built for performance, resilience, and efficiency.
Requirements
Key Responsibilities :
- Reliability & Operations : Establish and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) ; oversee incident response , root cause analysis , and continuous service improvement initiatives.
- Infrastructure Automation : Architect and manage scalable and secure cloud infrastructures using Infrastructure-as-Code (IaC) tools such as Terraform , Ansible , and CloudFormation .
- CI / CD Optimization : Build and optimize secure CI / CD pipelines (e.g., GitHub Actions , Jenkins ) with automated rollbacks, canary and blue-green deployments , and artifact validation processes.
- Observability & Monitoring : Develop advanced observability systems by creating dashboards , configuring alerts , and implementing synthetic checks for complete system visibility.
- Security Integration : Embed security testing and compliance tools (SAST, DAST, SBOM, secret scanning) into deployment workflows and enforce security policies-as-code .
- Cost & Capacity Management : Track and optimize cloud costs , manage capacity planning , and ensure efficient infrastructure utilization and uptime.
- Platform Enablement : Develop self-service tools and shared frameworks that enhance developer efficiency and maintain delivery consistency.
- Leadership & Mentorship : Act as a technical leader, mentor engineering teams, and champion best practices in reliability, automation, and secure delivery.
Required Skills & Experience :
Bachelor’s degree in Computer Science , Engineering , or related field.At least 5 years of experience in SRE, DevOps, or Platform Engineering , with leadership in reliability and automation.Minimum 3 years managing production-grade cloud systems using modern security and observability tools.Strong expertise in AWS , Azure , or GCP , especially in Compute, Networking, and IAM.Hands-on proficiency with Terraform , CloudFormation , Kubernetes , and Docker .Solid background in Linux systems , shell scripting , and programming in Python , Go , or Bash .Proficient with observability tools such as Prometheus , Grafana , ELK , Datadog , or CloudWatch .Proven experience designing and managing secure CI / CD pipelines and GitOps workflows .Deep understanding of SRE practices , including chaos engineering , SLO / SLA management , and capacity modeling .Strong documentation, communication, and leadership skills with a record of improving operational standards.