Talent.com
Site Reliability Engineer

Site Reliability Engineer

Bay Systems ConsultingBerkeley, CA, United States
1 day ago
Job type
  • Temporary
Job description

Overview

Site Reliability Engineer (SRE) role at Bay Systems Consulting. Location : Berkeley, CA (Onsite at Lawrence Berkeley National Laboratory). Employment Type : 5–6 Month Contract (Extension Possible). Pay Rate : $80 / hr + Full Benefits (Medical, Dental, Vision, 401k). Employer : Bay Systems Consulting.

About the Role : Bay Systems Consulting is seeking a Site Reliability Engineer (SRE) to support the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory. NERSC’s mission is to accelerate scientific discovery through high-performance computing and data analysis for the U.S. Department of Energy’s Office of Science. As an SRE in the Operations Group, you will help ensure the accessibility, reliability, security, and availability of world-class HPC systems that support over 10,000 scientific users. You will work with state-of-the-art monitoring systems (such as OMNI), respond to real-time alerts, automate processes, and improve reliability for mission-critical infrastructure.

Responsibilities

  • Monitor and support NERSC’s HPC facility as part of a 24x7 operations team (including some overnight “OWL” shifts).
  • Respond to alerts from computer systems, storage, networks, and data center infrastructure by triaging issues or engaging on-call staff.
  • Develop automation to handle routine service conditions and improve system efficiency.
  • Maintain and enhance monitoring tools, pipelines, and alerting systems.
  • Create and maintain scripts and software to integrate HPC system APIs into monitoring pipelines.
  • Collaborate with cross-functional NERSC groups to coordinate maintenance activities and manage diagnostic software.
  • Document and track outages, incidents, and maintenance in the ticketing system.
  • Troubleshoot and resolve diverse technical issues involving HPC, networking, and infrastructure.

Qualifications

  • Required (Level 2) : Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent work experience).
  • 5+ years of related experience (or 3+ years with a Master’s).
  • Strong Linux / Unix administration and command-line skills.
  • Proficiency with programming / scripting languages (Python, C / C++, Perl, Java, or similar).
  • Experience supporting highly available systems in large-scale data centers.
  • Familiarity with networking, firewalls, ACLs, and network protocols.
  • Knowledge of automation and monitoring tools (e.g., Kubernetes, Prometheus, Alertmanager).
  • Strong troubleshooting and communication skills.
  • Preferred (Level 3) : 8+ years of relevant experience (or 6+ with a Master’s).
  • Expertise in software development and monitoring pipeline design.
  • Experience leading technical projects and mentoring junior staff.
  • Advanced knowledge of data center management technologies.
  • #J-18808-Ljbffr

    Create a job alert for this search

    Site Reliability Engineer • Berkeley, CA, United States

    Related jobs
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood Materials, Inc.San Francisco, CA, United States
    Full-time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    WritemedSan Francisco, CA, United States
    Full-time
    Would you like to join one of the fastest-growing organizations with a goal of using the latest AI, GenAI, LLM, Cloud, and Digital Technologies to advance drug development and improve patient care ...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Together AISan Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    WorkOSSan Francisco, CA, United States
    Full-time
    WorkOS builds tools and services for developers to help them implement authentication, identity, authorization, and overall enterprise readiness. We’re a fully distributed team with employees across...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Rollbar, Inc.San Francisco, CA, United States
    Full-time
    Wikimedia Foundation is hiring a Senior Site Reliability Engineer (SRE) to join our Service Operations SRE team, where we take care of the infrastructure that runs wikipedia.The SRE team at Wikimed...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ZipRecruiterBerkeley, CA, United States
    Full-time
    Job DescriptionJob Description.We are seeking a Site Reliability Engineer to join our Operations Group.This role plays a key part in advancing scientific discovery by supporting high-performance co...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    AlchemySan Francisco, CA, United States
    Full-time
    Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PacerProSan Francisco, CA, United States
    Full-time
    You’ll be joining the engineering team responsible for delivering PacerPro’s SaaS and on-premise solutions that orchestrate case data workflows and provide data driven legal insights for our client...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    CheckrSan Francisco, CA, United States
    Full-time
    Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 1 day ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Loft OrbitalSan Francisco, CA, United States
    Full-time
    Loft Orbital is revolutionizing access to space by building reliable, shareable satellites that drastically reduce the time and complexity traditionally required to get to orbit.We operate satellit...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood MaterialsSan Francisco, CA, United States
    Full-time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling — keeping critical minerals in circulation and driving the energy transition.Founded in...Show moreLast updated: 1 day ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    ConductorOneSan Francisco, CA, United States
    Full-time
    Shape the future of identity with the highest-caliber team.If you’re amazing at what you do and want to solve big challenges in identity and security, come on board. Identity is how companies are be...Show moreLast updated: 4 hours ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PrimerSan Francisco, CA, United States
    Full-time
    Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    HiveSan Francisco, CA, United States
    Full-time
    Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    Hinge HealthSan Francisco, CA, United States
    Full-time
    From scaling Kubernetes clusters to improving observability with Datadog, we build the tooling and automation that empower product teams to ship with confidence. Collaborate with engineering teams t...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    VirtualVocationsFremont, California, United States
    Full-time
    A company is looking for a Mid-Sr.Site Reliability Engineer with a focus on on-prem Kubernetes / K8s.Key Responsibilities Manage and maintain on-premise containerized environments Deploy resources...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ZapierSan Francisco, CA, United States
    Full-time
    We're humans who simply think computers should do more work.At Zapier, we’re not just making software—we’re building a platform to help millions of businesses globally scale with automation and AI....Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Bits to AtomsSan Francisco, CA, United States
    Full-time
    Site Reliability Engineer (SRE).You’ll work at the intersection of infrastructure, AI / ML systems, and mission-critical physical operations. You’ll collaborate directly with engineering, AI, and oper...Show moreLast updated: 1 day ago
    • Promoted
    Site Reliability Engineer, Founding

    Site Reliability Engineer, Founding

    LimohealthSan Francisco, CA, United States
    Full-time
    At Charta, we're pioneering a transformative approach to healthcare billing through the power of generative AI.Our mission is to revolutionize this critical yet often cumbersome aspect of healthcar...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Checkr, Inc.San Francisco, CA, United States
    Full-time
    Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 1 day ago