Our client is looking for a remote SRE!
What You'll Do :
- Configure and maintain cloud infrastructure automation using Terraform, focusing on CDN optimization and content delivery performance
- Develop capacity planning strategies and performance optimization initiatives for high-volume spatial content delivery.
- Instrument services to understand system health. Create and optimize monitoring dashboards and alerting systems that provide actionable insights into streaming performance and user experience
- Design and implement comprehensive observability strategies for spatial streaming services, including SLI / SLO definition and error budget management
- Help define escalation policies and participate in on-call rotations to ensure the 24 / 7 health of their pipelines
- Lead incident response efforts and conduct thorough post-mortems to drive systemic improvements across web services
- Establish reliability engineering practices within the Web Services team, including code review processes and deployment safety measures
- Mentor DevOps engineers on operational best practices, reliability patterns, and production readiness standards
What You'll Bring :
7+ years of SRE or DevOps experience with a proven track record of improving system reliability and operational practicesStrong expertise in cloud platforms (AWS Fargate, CoreWeave), including infrastructure automation with Terraform and container orchestration with KubernetesDeep understanding of multi-tenant architecture security, data protection principles, and threat modeling for customer data handling systemsExperience with observability principles and hands-on experience in monitoring tools (Prometheus, Grafana) with reporting capabilitiesExperience implementing automated compliance monitoring (SOC 2, GDPR, ISO 27001, FOSS licensing)Excellent mentoring and technical leadership skills with the ability to influence engineering teams and drive adoption of security-first reliability practicesNo C2C, 3rd parties, or sponsorship