Senior Site Reliability Engineer

Eliassen GroupConcord, CA, US

30+ days ago

Job type

Full-time

Job description

Job Description

Description :

Hybrid | Concord, CA

We are seeking a Senior Site Reliability Engineer (SRE) to join our Digital Platform Engineering team and play a critical role in ensuring the reliability, scalability, and performance of our infrastructure and applications. This role supports our 24x7 production environments and is instrumental in driving zero-downtime operations across both containerized and VM-based workloads.

In addition to supporting day-to-day operations, this engineer will be a key contributor to two major transformation initiatives :

A large-scale migration from Tanzu Application Service (TAS) to RedHat OpenShift, requiring deep expertise in container orchestration, traffic management, and workload optimization.

A major migration from legacy datacenters to next-generation datacenter environments, involving modernization of infrastructure, deployment strategies, and operational readiness.

The ideal candidate is a seasoned engineer with deep expertise in Java application performance, Kubernetes, distributed systems, and observability, and is comfortable operating in complex, hybrid environments.

Due to client requirement, applicants must be willing and able to work on a w2 basis. For our w2 consultants, we offer a great benefits package that includes Medical, Dental, and Vision benefits, 401k with company matching, and life insurance.

Rate : $70 - $80 / hr. w2

Responsibilities :

Production Support & Escalation : Serve as a senior escalation point for Platform Engineers, providing expert troubleshooting for complex production issues.

Java Application Performance : Diagnose and resolve JVM-related issues including heap sizing, garbage collection tuning, thread management, and performance optimization.

Container Orchestration : Manage and optimize high-volume, enterprise-grade RedHat OpenShift or Kubernetes clusters for high availability, scalability, and fault tolerance. Ensure production readiness and operational excellence across complex, multi-tenant environments.

VM-Based Environments : Support applications running on RedHat Enterprise Linux (RHEL) hosted on virtual machines, ensuring seamless integration with containerized workloads.

Cluster Resource Management : Configure and monitor Kubernetes namespace quotas, Horizontal Pod Autoscalers (HPA), health probes, and overall cluster capacity. Apply a strong understanding of FinOps principles to optimize resource usage and manage infrastructure costs effectively.

Traffic Management : Configure and support load balancing technologies such as F5 and AVI Networks, including Global Traffic Management (GTM / GSLB) and Local Traffic Management (LTM).

Service Mesh : Implement and manage Istio or similar service mesh technologies, including gateway configuration, traffic routing, and observability.

Monitoring & Observability : Design and implement robust monitoring and alerting solutions using tools like AppDynamics, Elastic, Kiali, Splunk, Prometheus, and Grafana.

Distributed Tracing : Use distributed tracing tools such as Splunk Observability or Elastic APM to troubleshoot performance bottlenecks and latency issues across microservices.

Dashboard Creation : Build and maintain dashboards that provide actionable insights into system health, performance, and reliability using tools like Grafana and Splunk.

Migration Support : Play a key role in the migration from Tanzu Application Service (TAS) to RedHat OpenShift, ensuring continuity, performance, and reliability throughout the transition.

Datacenter Modernization : Support the migration of workloads from legacy datacenters to next-generation datacenter environments, contributing to architecture design, deployment strategies, and operational readiness.

OpenShift Onboarding Acceleration : Identify and eliminate friction points in the onboarding process for RedHat OpenShift, automate repetitive tasks, and implement efficiency improvements to accelerate team adoption and reduce time-to-production.

Performance Testing & Tuning : Design and execute performance tests to validate application behavior under load. Analyze results to ensure applications are properly sized and tuned before deployment to production environments.

Incident Response : Lead incident response efforts, conduct root cause analysis, and implement long-term fixes to prevent recurrence.

Training & Knowledge Sharing : Conduct training sessions and facilitate knowledge transfer on troubleshooting techniques, operational processes, and best practices to upskill team members and improve overall system reliability.

Standards & Risk Mitigation : Help define, report on, and enforce operational standards that promote system reliability, reduce risk, and ensure consistency across environments. Collaborate with teams to drive adoption of best practices and improve overall platform resilience.

Collaboration & Documentation : Partner with Platform Engineers, Developers, and other stakeholders to ensure smooth deployment and operation of Java-based applications. Maintain clear documentation for systems and troubleshooting procedures.

Experience Requirements :

Experience : 5+ years in Site Reliability Engineering, DevOps, or Infrastructure roles.

Container Platforms : Hands-on experience with RedHat OpenShift or Kubernetes in large-scale, highly available enterprise environments. Experience must go beyond lab or small-scale setups and include real-world production deployments supporting mission-critical workloads.

VM Environments : Experience supporting workloads on RHEL running on virtual machines.

Java Expertise : Strong understanding of JVM internals, garbage collection strategies, and performance tuning.

Traffic Management : Experience with F5, AVI Networks, GTM / GSLB, and LTM configurations.

Service Mesh : Hands-on experience with Istio or similar technologies.

Monitoring Tools : Proficiency with AppDynamics, Splunk Cloud, Splunk Observability, Prometheus, Grafana, or similar.

Distributed Tracing : Experience using tools like Splunk Observability or Elastic APM for troubleshooting distributed systems.

High Availability : Proven experience supporting highly available, distributed systems with zero-downtime requirements.

Cloud Platforms : Experience with AWS, Azure, or GCP is a plus.

Communication : Strong analytical and communication skills with the ability to work effectively across teams.

Skills, experience, and other compensable factors will be considered when determining pay rate. The pay range provided in this posting reflects a W2 hourly rate; other employment options may be available that may result in pay outside of the provided range.

W2 employees of Eliassen Group who are regularly scheduled to work 30 or more hours per week are eligible for the following benefits : medical (choice of 3 plans), dental, vision, pre-tax accounts, other voluntary benefits including life and disability insurance, 401(k) with match, and sick time if required by law in the worked-in state / locality.

Please be advised- If anyone reaches out to you about an open position connected with Eliassen Group, please confirm that they have an Eliassen.com email address and never provide personal or financial information to anyone who is not clearly associated with Eliassen Group. If you have any indication of fraudulent activity, please contact InfoSec@eliassen.com.

About Eliassen Group :

Eliassen Group is a leading strategic consulting company for human-powered solutions. For over 30 years, Eliassen has helped thousands of companies reach further and achieve more with their technology solutions, financial, risk & compliance, and advisory solutions, and clinical solutions. With offices from coast to coast and throughout Europe, Eliassen provides a local community presence, balanced with international reach. Eliassen Group strives to positively impact the lives of their employees, clients, consultants, and the communities in which they operate.

Eliassen Group is an Equal Opportunity / Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, pregnancy, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status.

Don’t miss out on our referral program! If we hire a candidate that you refer us to then you can be eligible for a $1,000 referral check!

Create a job alert for this search

Senior Site Reliability Engineer • Concord, CA, US

Related jobs

Promoted

Senior Site Reliability Engineer

GridwareSan Francisco, CA, US

Full-time

Gridware is a San Francisco-based technology company dedicated to protecting and enhancing the electrical grid.We pioneered a groundbreaking new class of grid management called active grid response...Show moreLast updated: 16 days ago

Promoted

Site Reliability Engineer I

ProsperSan Francisco, CA, United States

Full-time

As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show moreLast updated: 8 days ago

Promoted

Site Reliability Engineer

LTD GlobalBerkeley, CA, US

Full-time

We are seeking a Site Reliability Engineer to join our Operations Group.This role plays a key part in advancing scientific discovery by supporting high-performance computing (HPC) and data analysis...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Redwood Materials, Inc.San Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

CompunnelRichmond, CA, United States

Full-time

The Site Reliability Engineer will be responsible for ensuring the reliability, availability, and performance of applications and services as part of the transition from private to public cloud.Thi...Show moreLast updated: 3 days ago

Promoted

Site Reliability Engineer

Runloop AISan Francisco, CA, United States

Full-time

Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show moreLast updated: 12 days ago

Promoted

Site Reliability Engineer

WorkOSSan Francisco, CA, United States

Full-time

About WorkOS 🚀 WorkOS builds tools and services for developers to help them implement authentication, identity, authorization, and overall enterprise readiness. We’re a fully distributed team with ...Show moreLast updated: 30+ days ago

Promoted

Site Reliability EngineerSan Francisco

Together AISan Francisco, CA, United States

Full-time

As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

AlchemySan Francisco, CA, United States

Full-time

Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

LanceDBSan Francisco, CA, United States

Full-time

LanceDB is a developer-friendly, open-source data lake for multimodal AI.From hyper-scalable vector search to advanced retrieval for RAG, from streaming training data to interactive exploration of ...Show moreLast updated: 4 days ago

Promoted

Senior Site Reliability Engineer

ZiplineSouth San Francisco, CA, US

Full-time

Do you want to change the world? Zipline is on a mission to transform the way goods move.Our aim is to solve the world's most urgent and complex access challenges by building, manufacturing and...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

ReplitFoster City, CA, United States

Full-time

Replit is the agentic software creation platform that enables anyone to build applications using natural language.With millions of users worldwide and over 500,000 business users, Replit is democra...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer (Senior SRE)

CiroosPleasanton, CA, United States

Full-time

Senior Site Reliability Engineer (Senior SRE).Be among the first 25 applicants.Ciroos (pronounced Sai rose) is a seed?stage startup founded in February 2025 by a team of experienced executives and ...Show moreLast updated: 4 days ago

Promoted

Site Reliability Engineer

PrimerSan Francisco, CA, United States

Full-time

Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

Gridware Technologies Inc.San Francisco, CA, United States

Full-time

Promoted

Site Reliability Engineer I

Prosper.comSan Francisco, CA, United States

Full-time

Promoted

Site Reliability Engineer

Rockwoods IncPleasanton, CA, US

Full-time

Note : Candidates must have relevant experience in Medical / Healthcare domains, this is mandatory.Senior SRE Engineer - Pleasanton, 5 days office. Primary work : 24x7 On-call support and setting up mo...Show moreLast updated: 22 days ago

Promoted

Senior / Principal Site Reliability Engineer

DatacrunchSan Francisco, CA, United States

Full-time +1

Imagine a future where everyone has instant, low-cost access to intelligence.We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI m...Show moreLast updated: 1 day ago