We are seeking a highly skilled Site Reliability Engineer (SRE) to join a dynamic team at a rapidly growing technology company. As an SRE, you will be responsible for ensuring the reliability, scalability, and performance of mission-critical systems, while implementing automation and optimizing cloud infrastructure. This role offers the opportunity to work with cutting-edge AI / ML technologies , leveraging them to solve complex challenges in cloud infrastructure management and performance optimization.
Key Responsibilities :
System Reliability & Performance : Design, implement, and maintain scalable systems, ensuring high availability, performance, and disaster recovery across production environments.
Automation & Tool Development : Develop automation tools to streamline operations, improve system reliability, and reduce manual interventions.
Cloud Infrastructure Management : Create and manage cloud instances (e.g., dev, staging, production) using AWS, GCP, or Azure, optimizing infrastructure performance and cost.
Integration of AI / ML Models : Collaborate with engineering teams to integrate machine learning models into production environments, ensuring that these models scale efficiently and perform optimally.
Incident Management : Respond to and resolve incidents, minimizing downtime and ensuring quick recovery. Lead post-incident reviews and implement preventive measures.
Continuous Improvement : Identify areas of improvement and drive initiatives to enhance system reliability, performance, and security.
Security & Compliance : Ensure that infrastructure and applications adhere to security best practices and compliance standards.
Qualifications :
Educational Background : Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
Experience : Proven experience as a Site Reliability Engineer or in a similar role within a SaaS environment , managing and optimizing cloud infrastructure (preferably AWS, GCP, or Azure), and familiarity with integrating AI and machine learning technologies.
Technical Skills :
Proficiency in programming and scripting languages such as Python, Go, or Bash.
Experience with containerization and orchestration tools like Docker and Kubernetes.
Solid understanding of networking, security , and performance optimization practices.
Knowledge of CI / CD pipelines and DevOps practices to ensure smooth development and deployment cycles.
Problem-Solving : Strong analytical and problem-solving skills with attention to detail.
Collaboration & Communication : Excellent interpersonal skills, with the ability to work collaboratively in cross-functional teams and communicate technical concepts clearly.
Benefits :
Competitive Salary : Attractive compensation package, including equity options.
Health & Wellness : Comprehensive health, dental, and vision insurance, along with other benefits.
Work Environment : A collaborative and innovative work environment within a growing company.
Growth Opportunities : Opportunities for career growth, professional development, and a chance to shape the future of the company’s technology and infrastructure.