Job Summary We are seeking an experienced Site Reliability Engineer (SRE) to join the Applied AI and Data Science program. This role focuses on deploying, monitoring, and optimizing cloud-based applications and infrastructure to ensure high availability and performance. The ideal candidate will have strong expertise in AWS, containerized microservices, infrastructure automation, and monitoring tools.
Key Responsibilities :
- Release Management : Build and deploy application, service, and infrastructure releases; validate system integrity post-deployment; document release notes.
- Production Support : Maintain 99.999% availability of critical systems; monitor infrastructure and applications; perform root cause analysis for outages; respond to incidents.
- Monitoring & Alerting : Implement monitoring policies; build dashboards; track system efficiency and resource consumption; alert stakeholders for SLA deviations.
- Optimization : Manage resource scaling; optimize system performance and resource utilization.
- Team Collaboration : Assist with user support; coordinate with onshore / offshore teams; develop bug fixes; become an expert in system architecture and deployment pipelines.
Required Qualifications :
6+ years of DevOps or SRE experience in large, complex environments.Strong background in software development (OOP) and ability to read / debug code.Expertise in AWS services (EKS, S3, DocumentDB) and Terraform for Infrastructure as Code.Experience with Kubernetes, containerized microservices, and cloud deployments.Proficiency with GitLab or similar CI / CD tools for pipeline management.Hands-on experience with monitoring tools such as Datadog or Splunk.Bachelors degree in a related field or equivalent experience.Preferred Qualifications :
Familiarity with Python, Node.js, React, TypeScript, and GraphQL.Exposure to relational (SQL) and NoSQL databases.Experience with Docker, Redis, and ORM frameworks.Knowledge of experimentation, statistical testing, and data analysis.Masters degree in a related field is a plus.