Senior Site Reliability Engineer

GridwareSan Francisco, CA, US

16 days ago

Job type

Full-time

Job description

Job Description

About Gridware

Gridware is a San Francisco-based technology company dedicated to protecting and enhancing the electrical grid. We pioneered a groundbreaking new class of grid management called active grid response (AGR), focused on monitoring the electrical, physical, and environmental aspects of the grid that affect reliability and safety. Gridware’s advanced Active Grid Response platform uses high-precision sensors to detect potential issues early, enabling proactive maintenance and fault mitigation. This comprehensive approach helps improve safety, reduce outages, and ensure the grid operates efficiently. The company is backed by climate-tech and Silicon Valley investors. For more information, please visit www.Gridware.io.

Role Description

We are seeking a Senior Site Reliability Engineer to design, build, and maintain the infrastructure powering our modern, cloud-native applications. In this role, you will design and implement scalable and secure platforms on AWS, leveraging Kubernetes (EKS) and ArgoCD for GitOps-driven deployments. You’ll be responsible for building and optimizing CI / CD pipelines with GitHub Actions, managing event streaming with Amazon MSK, and maintaining reliable relational databases on RDS. You will own our Infrastructure as Code strategy with Terraform and drive best practices around security, identity management (IdP integrations), and cost optimization.

You will also play a key role in observability and platform reliability, building and maintaining monitoring and logging solutions with tools like Grafana, Loki, and Prometheus to ensure system performance and resilience. The successful candidate will work closely with our Cloud Security Engineer to enforce security standards, implement best practices, and ensure compliance across the infrastructure stack. This is a highly collaborative position where you’ll partner with engineering teams to deliver reliable environments, automate deployments, and improve developer velocity while staying ahead of modern DevOps and cloud-native practices.

What You’ll Do

Design, build, and maintain scalable, secure, and highly available infrastructure on AWS (EKS, EC2, RDS,MSK, S3, VPC …).
Manage and optimize Kubernetesclusters (EKS) and deploy applications using ArgoCD with GitOps best practices.
Implement and maintain CI / CD pipelines usingGitHub Actions (GHA), ensuring fast, reliable, and automated software delivery.
Build and support Kafka-based event streaming platforms using Amazon MSK for high-throughput, low-latency data pipelines.
Manage identity and access across platforms with IdP integration (Okta, Auth0, or similar).
Define and manage Infrastructure as Code with Terraform
Monitor, troubleshoot, and optimize system performance, cost, and reliability using observability tools like Grafana and Loki.

What We’re Looking For

5+ years in DevOps / SRE / Platform Engineering, with production experience in AWS infrastructure management.

Deep knowledge of Kubernetes administration and GitOps tools like ArgoCD.

Proficiency with Infrastructure as Code with Terraform

Hands-on experience with CI / CD automation and pipelines (preferably GitHub Actions).

Expertise in running and maintaining distributed systems such as Kafka on MSK and relational databases (RDS).

Strong understanding of networking, security best practices, and IdP-driven access control.

Experience with monitoring and logging solutions (Grafana,Loki, Prometheus, or similar).

Ability to debug complex production issues across infrastructure, deployment, and networking layers.

Bonus Points

Familiarity with Databricks o rML Ops pipelines for data and model deployment.

Experience with Terragrunt

Knowledge of multi-cloud or hybrid cloud environments and container security tools.

This describes the ideal candidate; many of us have picked up this expertise along the way. Even if you meet only part of this list, we encourage you to apply!

Benefits

Health, Dental & Vision (Gold and Platinum with some providers plans fully covered)

Paid parental leave

Alternating day off (every other Monday)

“Off the Grid”, a two week per year paid break for all employees.

Commuter allowance

Company-paid training

Create a job alert for this search

Senior Site Reliability Engineer • San Francisco, CA, US

Related jobs

Promoted

Senior Site Reliability Engineer

NVIDIASanta Clara, CA, United States

Full-time

NVIDIA is looking for a Senior Site Reliability Engineer to work in IPP (Infrastructure, Planning and Process).IPP is a global organization within NVIDIA. This group works with various other groups ...Show moreLast updated: 4 days ago

Promoted

Site Reliability Engineer I

ProsperSan Francisco, CA, United States

Full-time

As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show moreLast updated: 9 days ago

Promoted

Site Reliability Engineer

LTD GlobalBerkeley, CA, US

Full-time

We are seeking a Site Reliability Engineer to join our Operations Group.This role plays a key part in advancing scientific discovery by supporting high-performance computing (HPC) and data analysis...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Insight GlobalSanta Clara, CA, United States

Full-time

Insight Global is looking for a seasoned SRE to join one of our largest technology clients' multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working...Show moreLast updated: 4 days ago

Promoted

Site Reliability Engineer

Redwood Materials, Inc.San Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Runloop AISan Francisco, CA, United States

Full-time

Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show moreLast updated: 13 days ago

Promoted

Senior Site Reliability Engineer

Tarana WirelessMilpitas, CA, United States

Full-time

Join the Team That's Redefining Wireless Technology.At Tarana, we're more than just a fast-growing tech companywere a team of bold innovators on a mission to revolutionize broadband.Our groundbreak...Show moreLast updated: 4 days ago

Promoted

Site Reliability Engineer

Foxconn Industrial Internet - FIISan Jose, CA, US

Full-time +1

Foxconn Industrial Internet (Fii), is a world leading professional design and manufacturing service provider of communication network equipment, cloud service equipment, precision tools and industr...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

PSI QuantumPalo Alto, CA, United States

Full-time

Promoted

Senior Site Reliability Engineer

LanceDBSan Francisco, CA, United States

Full-time

LanceDB is a developer-friendly, open-source data lake for multimodal AI.From hyper-scalable vector search to advanced retrieval for RAG, from streaming training data to interactive exploration of ...Show moreLast updated: 5 days ago

Promoted

Senior Site Reliability Engineer

Signify TechnologyAtherton, CA, United States

Full-time

Senior Site Reliability Engineer.Competitive, based on experience.Join our innovative technology startup that is revolutionizing healthcare with a safety-focused AI platform.Our platform assists me...Show moreLast updated: 4 days ago

Promoted

Senior Site Reliability Engineer

Eliassen GroupConcord, CA, US

Full-time

We are seeking a Senior Site Reliability Engineer (SRE) to join our Digital Platform Engineering team and play a critical role in ensuring the reliability, scalability, and performance of our infra...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

Gridware Technologies Inc.San Francisco, CA, United States

Full-time

Promoted

Site Reliability Engineer I

Prosper.comSan Francisco, CA, United States

Full-time

Promoted

Site Reliability Engineer - Supercomputing

XaiPalo Alto, CA, United States

Full-time

Site Reliability Engineer - Supercomputing.We are seeking a talented Site Reliability Engineer (SRE) to join our SuperComputing team. In this role, you'll ensure the reliability, scalability, and pe...Show moreLast updated: 4 days ago

Promoted

Site Reliability Engineer

Signify TechnologyAtherton, CA, United States

Full-time

Competitive, based on experience.We are a technology startup advancing healthcare with a safety-focused AI platform that assists medical professionals by managing patient communications, including ...Show moreLast updated: 4 days ago

Promoted

Site Reliability Engineer

Rockwoods IncPleasanton, CA, US

Full-time

Note : Candidates must have relevant experience in Medical / Healthcare domains, this is mandatory.Senior SRE Engineer - Pleasanton, 5 days office. Primary work : 24x7 On-call support and setting up mo...Show moreLast updated: 22 days ago