Overview
Site Reliability Engineer (SRE) role at Bay Systems Consulting. Location : Berkeley, CA (Onsite at Lawrence Berkeley National Laboratory). Employment Type : 5–6 Month Contract (Extension Possible). Pay Rate : $80 / hr + Full Benefits (Medical, Dental, Vision, 401k). Employer : Bay Systems Consulting.
About the Role : Bay Systems Consulting is seeking a Site Reliability Engineer (SRE) to support the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory. NERSC’s mission is to accelerate scientific discovery through high-performance computing and data analysis for the U.S. Department of Energy’s Office of Science. As an SRE in the Operations Group, you will help ensure the accessibility, reliability, security, and availability of world-class HPC systems that support over 10,000 scientific users. You will work with state-of-the-art monitoring systems (such as OMNI), respond to real-time alerts, automate processes, and improve reliability for mission-critical infrastructure.
Responsibilities
- Monitor and support NERSC’s HPC facility as part of a 24x7 operations team (including some overnight “OWL” shifts).
- Respond to alerts from computer systems, storage, networks, and data center infrastructure by triaging issues or engaging on-call staff.
- Develop automation to handle routine service conditions and improve system efficiency.
- Maintain and enhance monitoring tools, pipelines, and alerting systems.
- Create and maintain scripts and software to integrate HPC system APIs into monitoring pipelines.
- Collaborate with cross-functional NERSC groups to coordinate maintenance activities and manage diagnostic software.
- Document and track outages, incidents, and maintenance in the ticketing system.
- Troubleshoot and resolve diverse technical issues involving HPC, networking, and infrastructure.
Qualifications
Required (Level 2) : Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent work experience).5+ years of related experience (or 3+ years with a Master’s).Strong Linux / Unix administration and command-line skills.Proficiency with programming / scripting languages (Python, C / C++, Perl, Java, or similar).Experience supporting highly available systems in large-scale data centers.Familiarity with networking, firewalls, ACLs, and network protocols.Knowledge of automation and monitoring tools (e.g., Kubernetes, Prometheus, Alertmanager).Strong troubleshooting and communication skills.Preferred (Level 3) : 8+ years of relevant experience (or 6+ with a Master’s).Expertise in software development and monitoring pipeline design.Experience leading technical projects and mentoring junior staff.Advanced knowledge of data center management technologies.#J-18808-Ljbffr