Clientmind Recruiting is searching for a Site Reliability Engineer for a growing tech company based in the Bethesda, MD area. This will be onsite 1x per week (Tuesday).
This role centers on maintaining the “common” IaC constructs (Python-based abstractions in AWS CDK and CDK8s) that define their platform. These include networking, EKS configuration, data stores, observability, autoscaling patterns, and deployment primitives. You’ll work closely with backend engineers to make infrastructure safe, consistent, and easy to adopt.
Responsibilities
- Design, implement, and evolve shared CDK and CDK8s constructs used by multiple services and teams.
- Maintain base infrastructure components : VPC, EKS, node groups, RDS, OpenSearch, and MSK.
- Operate and extend Kubernetes cluster addons : ingress controllers, cert‑manager, autoscaler, monitoring / logging stacks.
- Ensure high reliability through well‑structured alerting (Prometheus, CloudWatch), autoscaling, and recovery patterns.
- Manage and publish baseline templates, configuration schemas, and documentation for infrastructure usage.
- Own the CI / CD processes for IaC codebases and platform component releases.
- Collaborate with engineering teams to diagnose infrastructure issues and propose robust solutions.
- Apply SRE principles—SLIs / SLOs, observability, fault‑tolerance to all shared platform services.
- Support IAM roles, secrets management, and tenant isolation patterns.
Required Experience
5+ years of infrastructure or SRE experience, including AWS (VPC, IAM, RDS, MSK, S3) and Kubernetes (Helm, RBAC, ServiceAccounts).Fluency in Python and experience with Infrastructure-as-Code using AWS CDK, CDK8s, or equivalent frameworks.Strong understanding of Prometheus, Grafana, and alert routing practices.Experience designing reusable infrastructure patterns or internal developer platforms.Proven ability to improve reliability through automation, monitoring, and operational best practices.Nice to Have
Experience supporting Spark on Kubernetes, Argo, or Kafka‑based batch pipelines.Awareness of cost‑efficiency strategies across EC2, storage, and autoscaling.#J-18808-Ljbffr