Talent.com
Senior Site Reliability Engineer
Senior Site Reliability EngineerZealogics.com • Alpharetta, GA, US
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Zealogics.com • Alpharetta, GA, US
2 days ago
Job type
  • Full-time
Job description

Key Responsibilities

Infrastructure & Automation

  • Design, deploy, and manage cloud infrastructure across AWS and Azure using Terraform and infrastructure-as-code principles
  • Architect, deploy, and maintain production-grade Kubernetes clusters with a focus on reliability, security, and performance
  • Serve as the subject matter expert on Kubernetes, providing guidance and best practices to engineering teams
  • Build and maintain automated provisioning pipelines to ensure consistent, repeatable deployments
  • Implement and maintain HashiCorp Vault on AWS for secrets management and security, including Vault integration with Kubernetes
  • Design and implement automated High Availability and Disaster Recovery (HA / DR) capabilities through CI / CD pipelines
  • Optimize cloud resources and Kubernetes workloads for performance, cost efficiency, and reliability

Observability & Monitoring

  • Architect and implement comprehensive observability solutions using Datadog for cloud-native applications and Kubernetes infrastructure
  • Build monitoring, logging, and alerting frameworks for containerized workloads that provide actionable insights into system health
  • Implement Kubernetes-native monitoring patterns and troubleshoot complex container orchestration issues
  • Integrate Datadog with PagerDuty and other incident management platforms
  • Define and track SLIs, SLOs, and error budgets to drive reliability improvements
  • Create custom dashboards and monitors to track infrastructure, application, and Kubernetes cluster performance
  • CI / CD & Pipeline Management

  • Design, build, and maintain robust CI / CD pipelines that enable rapid, safe deployments to Kubernetes
  • Implement GitOps workflows and automated deployment strategies for containerized applications
  • Implement automated testing, security scanning, and quality gates within pipelines
  • Drive solutions through test, QA, and production environments with appropriate controls and safeguards
  • Automate deployment strategies including blue-green, canary, and rolling deployments in Kubernetes
  • Security & Vulnerability Management

  • Identify, assess, and remediate security vulnerabilities in infrastructure, applications, and Kubernetes clusters
  • Implement Kubernetes security best practices including RBAC, pod security policies / standards, and network policies
  • Collaborate with security teams to implement and maintain security best practices
  • Manage and maintain HashiCorp Vault infrastructure for secure secrets management
  • Ensure compliance with security policies and industry standards across all environments
  • Incident Management & Response

  • Participate in 24 / 7 on-call rotation to respond to critical production incidents
  • Serve as Incident Commander, coordinating cross-functional response teams during major outages
  • Lead post-incident reviews and drive thorough root cause analysis across engineering teams
  • Troubleshoot complex Kubernetes and distributed systems issues under pressure
  • Develop and refine incident response procedures and runbooks
  • Collaboration & Leadership

  • Partner with engineering teams to improve system reliability and performance
  • Mentor junior SREs and promote SRE best practices across the organization
  • Lead Kubernetes adoption efforts and educate teams on container orchestration best practices
  • Drive initiatives to reduce toil through automation and process improvement
  • Contribute to architectural decisions with a reliability and operability lens
  • Required Qualifications

  • 5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
  • Expert-level knowledge of Kubernetes<>
  • , including architecture, operations, and troubleshooting in production environments

  • Proven track record as a go-to Kubernetes resource and technical authority
  • Deep understanding of container technologies (Docker, containerd) and orchestration patterns
  • Strong hands-on experience with AWS and Azure cloud platforms
  • Proficiency in Terraform for infrastructure automation and management
  • Expert-level knowledge of Datadog for monitoring, logging, and observability
  • Experience with HashiCorp Vault, including deployment and management on AWS and Kubernetes integration
  • Deep understanding of CI / CD pipelines, including design, implementation, and optimization for containerized workloads
  • Proven ability to implement automated HA / DR solutions through CI / CD workflows
  • Strong programming skills in Python for automation, tooling, and analysis
  • Proven experience building observability solutions for distributed cloud applications
  • Experience configuring monitoring and alerting systems and integrating with paging platforms like PagerDuty
  • Demonstrated experience identifying and remediating security vulnerabilities
  • Experience driving deployments through multiple environments (test / QA / production) with proper gates and controls
  • Demonstrated experience participating in on-call rotations and responding to production incidents
  • Experience serving as Incident Commander or leading incident response efforts
  • Track record of conducting root cause analysis and driving systemic improvements
  • Strong understanding of networking, security, and cloud architecture principles
  • Excellent communication skills with ability to work across multiple teams and explain complex Kubernetes concepts
  • Preferred Qualifications

  • Experience with Google Cloud Platform (GCP) and GKE
  • Certified Kubernetes Administrator (CKA) or Certified Kubernetes Security Specialist (CKS)
  • Experience with service mesh technologies (Istio, Linkerd, Consul)
  • Knowledge of Helm, Kustomize, and other Kubernetes tooling
  • Experience with GitOps tools (ArgoCD, Flux)
  • Familiarity with additional CI / CD tools (Jenkins, GitLab CI, GitHub Actions, CircleCI)
  • Experience with configuration management tools (Ansible, Chef, Puppet)
  • Background in software engineering or systems programming
  • Understanding of chaos engineering and reliability testing methodologies
  • Experience with cost optimization strategies in cloud and Kubernetes environments
  • Security certifications (AWS Security Specialty, CISSP, CKS, etc.)
  • Experience with compliance frameworks (SOC 2, ISO 27001, etc.)
  • Contributions to open-source Kubernetes projects or active participation in the Kubernetes community
  • rate range -$50-$55

    J-18808-Ljbffr

    Create a job alert for this search

    Senior Site Reliability Engineer • Alpharetta, GA, US

    Related jobs
    Site Reliability Engineer

    Site Reliability Engineer

    VirtualVocations • Alpharetta, Georgia, United States
    Full-time
    A company is looking for a Site Reliability Engineer (P3) - Cloud Infrastructure.Key Responsibilities Automate day-to-day operations and improve system reliability Design and implement highly av...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Amicis Global • Alpharetta, GA, United States
    Full-time
    Quick Apply
    Title : Senior Site Reliability Engineer Location : Alpharetta, GA Duration : 6-12+ Months About the Role&lt;...Show more
    Last updated: 3 days ago
    Site Reliability Engineer (Senior SRE)

    Site Reliability Engineer (Senior SRE)

    Cloudious LLC • Atlanta, GA, United States
    Full-time
    Quick Apply
    Site Reliability Engineer (Senior SRE) Skills : Software Engineering, Gcp, Sre, Kubernetes, AWS Glider MUST &lt; / ...Show more
    Last updated: 14 hours ago • New!
    Reliability Engineer

    Reliability Engineer

    Greif • Austell, GA, US
    Full-time
    Job Opportunity : Reliability Engineer.Greif offers a great working environment and the opportunity to make an immediate impact at a company where your ideas are always welcome.Being the best custom...Show more
    Last updated: 7 days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    National Black MBA Association • Kennesaw, GA, US
    Full-time
    At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clien...Show more
    Last updated: 4 days ago • Promoted
    Site Reliability Engineering Manager

    Site Reliability Engineering Manager

    LexisNexis Risk Solutions • Alpharetta, GA, US
    Full-time
    Are you an experienced Site Reliability Engineering leader ready to shape strategy, inspire teams, and drive innovation at scale?. Are you looking to lead a high-impact SRE team where your leadershi...Show more
    Last updated: 7 days ago • Promoted
    Site reliability engineer with GCP

    Site reliability engineer with GCP

    IPS Technology Services • Alpharetta, GA, us
    Full-time
    Quick Apply
    Location : Alpharetta, GA (Day 1 onsite or Hybrid).SRE, GCP, Python, Bash, Container (Docker, kubernates), Terraform, Terragrunt. GCP certification preferred (Associate cloud engineer, Dev Ops or Arc...Show more
    Last updated: 12 hours ago • New!
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Diversity Resource Staffing, Inc. • Sandy Springs, GA, US
    Full-time
    This is an exciting opportunity for a Senior Site ReliabilityEngineer in the Consumer SRE Team at IMT division, to provide secure, resilient, scalable and maintainable services for mortgage borrowe...Show more
    Last updated: 7 days ago • Promoted
    Senior Site Reilability Engineer

    Senior Site Reilability Engineer

    Hobbsnews • Kennesaw, GA, US
    Full-time
    At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clien...Show more
    Last updated: 4 days ago • Promoted
    Cloud Site Reliability Engineer

    Cloud Site Reliability Engineer

    VirtualVocations • Marietta, Georgia, United States
    Full-time
    A company is looking for a Cloud Site Reliability Engineer (AWS).Key Responsibilities Design, deploy, and maintain AWS cloud infrastructure for high availability and fault tolerance Administer M...Show more
    Last updated: 30+ days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    VirtualVocations • Lawrenceville, Georgia, United States
    Full-time
    A company is looking for a Senior Site Reliability Engineer to help scale its platform and ensure system reliability.Key Responsibilities Act as a first responder for system incidents and outages...Show more
    Last updated: 30+ days ago • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    VirtualVocations • Norcross, Georgia, United States
    Full-time
    A company is looking for a Principal Site Reliability Engineer.Key Responsibilities Lead the technical direction of the team while contributing to the design and implementation of self-service to...Show more
    Last updated: 30+ days ago • Promoted
    DevOps Site Reliability Engineer

    DevOps Site Reliability Engineer

    VirtualVocations • Marietta, Georgia, United States
    Full-time
    A company is looking for a DevOps / Site Reliability Engineer (Remote).Key Responsibilities Configure, manage, and improve CI / CD pipelines for application deployments Monitor application perform...Show more
    Last updated: 1 day ago • Promoted
    Manager Site Reliability Engineering

    Manager Site Reliability Engineering

    RELX • Alpharetta, GA, US
    Full-time
    Are you an experienced site reliability engineering leader ready to shape strategy, inspire teams, and drive innovation at scale? Are you looking to lead a high-impact sre team where your leadershi...Show more
    Last updated: 30+ days ago • Promoted
    Senior Manager -Reliability Engineer and Observability Platforms

    Senior Manager -Reliability Engineer and Observability Platforms

    Inspire Brands • Cartersville, GA, US
    Full-time
    We are seeking an experienced and dynamic.Senior Manager, Reliability Engineering & Observability Platforms.This role is accountable for designing and managing platforms that ensure visibility,...Show more
    Last updated: 7 days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    TEKsystems • Atlanta, GA, United States
    Full-time
    Duration : 3 month w2 contract to hire.Location : 4 days onsite & 1 day remote- Charlotte, NC or Atlanta, GA.The hiring manager is more focused on SRE Practice (being able to bring the knowledge of p...Show more
    Last updated: 13 days ago • Promoted
    Senior Civil Site Design Engineer - Earn Up To $170k Annually - Tucker, GA

    Senior Civil Site Design Engineer - Earn Up To $170k Annually - Tucker, GA

    Graham & Associates • Tucker, GA, United States
    Full-time
    Graham & Associates is seeking a highly skilled Civil Site Design Engineer to join our team and lead the design efforts for projects at Atlanta Hartsfield-Jackson Airport.Earn $140k to $170k Annual...Show more
    Last updated: 4 days ago • Promoted
    Senior Site Reliability Engineer I Atlanta, Georgia, United States Atlanta, Georgia

    Senior Site Reliability Engineer I Atlanta, Georgia, United States Atlanta, Georgia

    Axon Enterprise • Peachtree Corners, GA, US
    Full-time
    Location : Atlanta, Georgia, United States.Join Axon and be a Force for Good.This role is based out of our Atlanta, GA office (Peachtree Corners) and follows a hybrid schedule.We rely on in-person c...Show more
    Last updated: 7 days ago • Promoted