Manage Clients on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data centers.
Guard service level agreements (SLAs) for critical engineering services. Implement monitoring, alerting, and incident response procedures to ensure adherence to defined performance targets. Perform root cause analysis and post-mortems of incidents for any threshold breaches
Set up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack to oversee system health and performance. Maintain KPI pipelines using Jenkins, Python and ELK.
Improve monitoring systems by adding custom alerts based on business needs.
Help in capacity planning, optimization and better utilization efforts.
Create and maintain documentation for operational procedures, configurations, and troubleshooting guides.
Skills :
Hands-on on-prem SRE and infrastructure operations
Strong in monitoring & observability using Prometheus, Grafana, ELK, with KPI pipeline integration via Jenkins / Python
Proficient in automation and scripting using Jenkins, Python, Go, Bash