Job Description
Job Description
Site Reliability Engineer
Onsite- Bay Area, CA
Skills
Relevant Skills and Experience
What You’ll Do (Day-to-Day)
Own and manage our cloud infrastructure (GCP or AWS, on-prem).
Build, maintain, and optimize Kubernetes clusters (including GPU-backed clusters).
Implement and improve CI / CD pipelines (GitHub Actions).
Write and maintain Infrastructure as Code (Terraform).
Monitor system health and performance using Grafana and other observability tools.
Ensure high availability, reliability, and uptime across platforms.
Handle infrastructure maintenance, upgrades, and scaling.
Administer and improve our platform architecture and apply general security best practices across the stack.
Note : This is an internal-facing role — no customer interaction.
Must-Have :
4+ years in SRE, DevOps, or Infrastructure Engineering
Solid experience with GCP or AWS (hybrid / on-prem a plus)
Experience with Kubernetes cluster management (GPU experience a bonus)
Hands-on with Terraform and CI / CD (GitHub)
Experience with monitoring / observability (Grafana, etc.)
Strong understanding of high availability and infrastructure reliability
Familiarity with platform / cluster architecture and administration
Security mindset and ability to apply best practice
Nice-to-Have :
Startup experience (you enjoy building, not just maintaining)
Experience with scalable GPU infrastructure for AI / ML
Site Reliability Engineer • Mountain View, CA, United States