Site Reliability Engineer

Bay Systems ConsultingBerkeley, CA, United States

1 day ago

Job type

Temporary

Job description

Overview

Site Reliability Engineer (SRE) role at Bay Systems Consulting. Location : Berkeley, CA (Onsite at Lawrence Berkeley National Laboratory). Employment Type : 5–6 Month Contract (Extension Possible). Pay Rate : $80 / hr + Full Benefits (Medical, Dental, Vision, 401k). Employer : Bay Systems Consulting.

About the Role : Bay Systems Consulting is seeking a Site Reliability Engineer (SRE) to support the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory. NERSC’s mission is to accelerate scientific discovery through high-performance computing and data analysis for the U.S. Department of Energy’s Office of Science. As an SRE in the Operations Group, you will help ensure the accessibility, reliability, security, and availability of world-class HPC systems that support over 10,000 scientific users. You will work with state-of-the-art monitoring systems (such as OMNI), respond to real-time alerts, automate processes, and improve reliability for mission-critical infrastructure.

Responsibilities

Monitor and support NERSC’s HPC facility as part of a 24x7 operations team (including some overnight “OWL” shifts).
Respond to alerts from computer systems, storage, networks, and data center infrastructure by triaging issues or engaging on-call staff.
Develop automation to handle routine service conditions and improve system efficiency.
Maintain and enhance monitoring tools, pipelines, and alerting systems.
Create and maintain scripts and software to integrate HPC system APIs into monitoring pipelines.
Collaborate with cross-functional NERSC groups to coordinate maintenance activities and manage diagnostic software.
Document and track outages, incidents, and maintenance in the ticketing system.
Troubleshoot and resolve diverse technical issues involving HPC, networking, and infrastructure.

Qualifications

Required (Level 2) : Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent work experience).

5+ years of related experience (or 3+ years with a Master’s).

Strong Linux / Unix administration and command-line skills.

Proficiency with programming / scripting languages (Python, C / C++, Perl, Java, or similar).

Experience supporting highly available systems in large-scale data centers.

Familiarity with networking, firewalls, ACLs, and network protocols.

Knowledge of automation and monitoring tools (e.g., Kubernetes, Prometheus, Alertmanager).

Strong troubleshooting and communication skills.

Preferred (Level 3) : 8+ years of relevant experience (or 6+ with a Master’s).

Expertise in software development and monitoring pipeline design.

Experience leading technical projects and mentoring junior staff.

Advanced knowledge of data center management technologies.

#J-18808-Ljbffr

Create a job alert for this search

Site Reliability Engineer • Berkeley, CA, United States

Related jobs

Promoted

Site Reliability Engineer

Redwood Materials, Inc.San Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

WritemedSan Francisco, CA, United States

Full-time

Would you like to join one of the fastest-growing organizations with a goal of using the latest AI, GenAI, LLM, Cloud, and Digital Technologies to advance drug development and improve patient care ...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

Together AISan Francisco, CA, United States

Full-time

As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

WorkOSSan Francisco, CA, United States

Full-time

WorkOS builds tools and services for developers to help them implement authentication, identity, authorization, and overall enterprise readiness. We’re a fully distributed team with employees across...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

Rollbar, Inc.San Francisco, CA, United States

Full-time

Wikimedia Foundation is hiring a Senior Site Reliability Engineer (SRE) to join our Service Operations SRE team, where we take care of the infrastructure that runs wikipedia.The SRE team at Wikimed...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

ZipRecruiterBerkeley, CA, United States

Full-time

Job DescriptionJob Description.We are seeking a Site Reliability Engineer to join our Operations Group.This role plays a key part in advancing scientific discovery by supporting high-performance co...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

AlchemySan Francisco, CA, United States

Full-time

Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

PacerProSan Francisco, CA, United States

Full-time

You’ll be joining the engineering team responsible for delivering PacerPro’s SaaS and on-premise solutions that orchestrate case data workflows and provide data driven legal insights for our client...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

CheckrSan Francisco, CA, United States

Full-time

Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 1 day ago

Promoted

Senior Site Reliability Engineer

Loft OrbitalSan Francisco, CA, United States

Full-time

Loft Orbital is revolutionizing access to space by building reliable, shareable satellites that drastically reduce the time and complexity traditionally required to get to orbit.We operate satellit...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

Redwood MaterialsSan Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling — keeping critical minerals in circulation and driving the energy transition.Founded in...Show moreLast updated: 1 day ago

Promoted
New!

Site Reliability Engineer

ConductorOneSan Francisco, CA, United States

Full-time

Shape the future of identity with the highest-caliber team.If you’re amazing at what you do and want to solve big challenges in identity and security, come on board. Identity is how companies are be...Show moreLast updated: 4 hours ago

Promoted

Site Reliability Engineer

PrimerSan Francisco, CA, United States

Full-time

Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

HiveSan Francisco, CA, United States

Full-time

Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer II

Hinge HealthSan Francisco, CA, United States

Full-time

From scaling Kubernetes clusters to improving observability with Datadog, we build the tooling and automation that empower product teams to ship with confidence. Collaborate with engineering teams t...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

VirtualVocationsFremont, California, United States

Full-time

A company is looking for a Mid-Sr.Site Reliability Engineer with a focus on on-prem Kubernetes / K8s.Key Responsibilities Manage and maintain on-premise containerized environments Deploy resources...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

ZapierSan Francisco, CA, United States

Full-time

We're humans who simply think computers should do more work.At Zapier, we’re not just making software—we’re building a platform to help millions of businesses globally scale with automation and AI....Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

Bits to AtomsSan Francisco, CA, United States

Full-time

Site Reliability Engineer (SRE).You’ll work at the intersection of infrastructure, AI / ML systems, and mission-critical physical operations. You’ll collaborate directly with engineering, AI, and oper...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer, Founding

LimohealthSan Francisco, CA, United States

Full-time

At Charta, we're pioneering a transformative approach to healthcare billing through the power of generative AI.Our mission is to revolutionize this critical yet often cumbersome aspect of healthcar...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

Checkr, Inc.San Francisco, CA, United States

Full-time