Role Summary
We are seeking an experienced Monitoring Lead with strong expertise in observability tools such as Datadog, Grafana, and Kibana. The candidate will lead the design, implementation, and management of monitoring and observability solutions to ensure system reliability, performance, and availability across enterprise applications and infrastructure.
Key Responsibilities
- Lead the observability and monitoring strategy across infrastructure, applications, and services.
- Configure, maintain, and optimize monitoring tools ( Datadog, Grafana, Kibana ) for real-time visibility.
- Develop dashboards, alerts, and metrics to support proactive incident detection and resolution.
- Collaborate with DevOps, Cloud, and Application teams to define SLA / SLO / SLI metrics and ensure service reliability.
- Implement log aggregation, tracing, and metrics collection to improve end-to-end observability.
- Troubleshoot performance bottlenecks and identify root causes using monitoring insights.
- Provide leadership, guidance, and training to teams on best practices for observability and monitoring.
- Stay updated on emerging monitoring technologies and recommend adoption where relevant.
Required Skills & Experience
8-12 years of experience in IT operations / DevOps with at least 3+ years in monitoring & observability leadership roles.Strong hands-on experience with Datadog, Grafana, and Kibana .Knowledge of observability practices (metrics, logs, traces).Experience with cloud platforms (AWS, Azure, GCP) and containerized environments (Kubernetes, Docker).Proficiency in scripting (Python, Shell, PowerShell, etc.) for automation of monitoring tasks.Excellent troubleshooting, analytical, and communication skills.Good to Have
Experience with Prometheus, Elastic Stack, Splunk, or New Relic .Familiarity with Site Reliability Engineering (SRE) practices .Exposure to infrastructure-as-code tools (Terraform, Ansible) .