Talent.com
Site Reliability Engineer (SRE)
Site Reliability Engineer (SRE)ShiftPixy Resources Inc • District of Columbia, WA, United States
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

ShiftPixy Resources Inc • District of Columbia, WA, United States
13 hours ago
Job type
  • Full-time
  • Quick Apply
Job description

Responsibilities Deployment & Automation

  • Implement and maintain CI / CD pipelines using tools such as GitHub Actions, AWS CodePipeline, and Jenkins.
  • Automate infrastructure provisioning and management using Infrastructure-as-Code (IaC) with Terraform, CloudFormation, or AWS CDK.
  • Develop robust automation scripts and self-service tooling to minimize toil and enhance operational efficiency.

Capacity, Performance & Cost Optimization

  • Lead and implement operational cost optimization initiatives across cloud infrastructure and data platforms.
  • Configure, maintain, and tune auto-scaling policies and performance thresholds.
  • Develop and execute Resiliency Test plans and provide critical support for Performance testing efforts.
  • Incident Management & SRE Principles

  • Serve as a production on-call responder, employing strong troubleshooting skills to quickly resolve complex incidents.
  • Proficiently utilize ITIL framework concepts and ITSM tools (e.g., ServiceNow) for incident and change management.
  • Develop high-quality Root Cause Analysis (RCA) documentation and Knowledge articles to prevent future recurrence.
  • Implement and enforce SRE principles, including the definition and tracking of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
  • Observability & Monitoring

  • Manage and leverage advanced observability platforms (Dynatrace preferred, AppDynamics, ELK, etc.).
  • Implement distributed tracing with accurate context propagation across data services and applications.
  • Optimize monitoring queries, and configure actionable dashboards, alerts, and anomaly detectors using tools like Dynatrace and Kibana.
  • Data Analytics Platform Reliability

  • Ensure the reliability, performance tuning, and access control for Databricks cluster management and data pipelines.
  • Maintain Informatica workflow orchestration, connector reliability, and error handling for critical data flows.
  • Manage Power BI gateway health, access control, and ensure reliable data refresh processes.
  • Security & Compliance

  • Manage service accounts, access permissions, and roles following the principle of least privilege.
  • Create, deploy, and manage digital certificates and TLS / SSL configurations.
  • Execute effective remediation tasks and respond to security incidents as part of the operational team.
  • Qualifications Education & Experience

  • Bachelor's degree in Computer Science, Engineering, or a related technical field.
  • 2 to 4 years of hands-on experience in a DevOps, Site Reliability Engineering (SRE), or Cloud Infrastructure role.
  • Practical, working experience with major cloud platforms, specifically AWS and Azure.
  • Technical Skills

  • Mid-level proficiency in Python or other scripting languages (e.g., Bash, Go) for automation tasks.
  • Mid-level proficiency with Configuration Management tools, including Ansible.
  • Strong knowledge of containerization technologies (Docker, Kubernetes / ECS).
  • Solid understanding of Linux systems and networking fundamentals (TCP / IP, DNS, Load Balancing).
  • Working knowledge of relational, cloud-native (e.g., AWS RDS), and NoSQL database technologies.
  • Direct hands-on experience supporting and maintaining data platforms like Databricks, Informatica, or Power BI is highly desirable.
  • Professional Attributes

  • Excellent written and verbal communication skills, with a proven ability to document complex systems.
  • Demonstrated ability to work independently, manage shifting priorities, and drive initiatives to completion.
  • Availability for on-call duties and to work outside of standard business hours as required to support a 24 / 7 production environment.
  • Create a job alert for this search

    Site Reliability Engineer • District of Columbia, WA, United States