Job Description
About the Role
We are seeking a highly skilled Senior Site Reliability Engineer to join our team. In this role responsibilities will include designing and implementing infrastructure automation, continuous integration and delivery pipelines, and monitoring and scaling the infrastructure that powers our healthcare AI platform. You will work closely with software engineers, research scientists, and other cross-functional teams to develop and maintain reliable and scalable infrastructure that enables rapid iteration and deployment of our products.
Key Responsibilities
- Design and implement infrastructure automation and deployment pipelines using tools such as Terraform
- Implement and maintain monitoring and logging systems to ensure the reliability and performance of our healthcare AI platform
- Work closely with software engineers to design and deploy scalable, fault-tolerant, and secure production systems on cloud platforms such as AWS, GCP, or Azure
- Develop and maintain security and compliance policies and procedures for our healthcare AI platform
- Collaborate with cross-functional teams to troubleshoot and resolve complex issues related to infrastructure, deployment, and operations
- Implement and maintain disaster recovery and business continuity plans
- Develop and maintain documentation related to infrastructure, deployment, and operations
- Mentor and provide technical guidance to junior engineers
Qualifications
Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related fieldAt least 5 years of professional experience as SREStrong skills in building cloud infra orchestration systems (Operators) using python, GoExpertise in infrastructure automation and deployment tools such as Terraform, or GitLab CI / CDExperience with cloud platforms such as AWS, GCP, or AzureStrong knowledge of containerization technologies such as Docker and KubernetesExperience with monitoring and logging tools such as ELK, Grafana, or DatadogFamiliarity with security and compliance best practices and tools such as HashiCorp Vault, AWS KMS, or Azure Key VaultStrong problem-solving skills and ability to work independently and collaboratively in a team environmentExcellent communication and interpersonal skillsExperience implementing HIPAA and SOC2 compliance in a plusExperience working in an HPC Environment is a plus