Talent.com
Site Reliability Engineer

Site Reliability Engineer

Bay Systems Consulting Inc.Berkeley, California, United States, 94720
24 days ago
Job type
  • Temporary
  • Quick Apply
Job description

Location : Berkeley, CA (Onsite at Lawrence Berkeley National Laboratory)

Employment Type : 5–6 Month Contract (Extension Possible)

Pay Rate :  $80 / hr + Full Benefits (Medical, Dental, Vision, 401k)

Employer : Bay Systems Consulting

About the Role

Bay Systems Consulting is seeking a Site Reliability Engineer (SRE) to support the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory. NERSC’s mission is to accelerate scientific discovery through high-performance computing and data analysis for the U.S. Department of Energy’s Office of Science.

As an SRE in the Operations Group, you will help ensure the accessibility, reliability, security, and availability of world-class HPC systems that support over 10,000 scientific users. You will work with state-of-the-art monitoring systems (such as OMNI), responding to real-time alerts, automating processes, and improving reliability for mission-critical infrastructure.

Key Responsibilities

  • Monitor and support NERSC’s HPC facility as part of a 24x7 operations team (including some overnight “OWL” shifts).
  • Respond to alerts from computer systems, storage, networks, and data center infrastructure by triaging issues or engaging on-call staff.
  • Develop automation to handle routine service conditions and improve system efficiency.
  • Maintain and enhance monitoring tools, pipelines, and alerting systems.
  • Create and maintain scripts and software to integrate HPC system APIs into monitoring pipelines.
  • Collaborate with cross-functional NERSC groups to coordinate maintenance activities and manage diagnostic software.
  • Document and track outages, incidents, and maintenance in the ticketing system.
  • Troubleshoot and resolve diverse technical issues involving HPC, networking, and infrastructure.

Qualifications

Required (Level 2) :

  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent work experience).
  • 5+ years of related experience (or 3+ years with a Master’s).
  • Strong Linux / Unix administration and command-line skills.
  • Proficiency with programming / scripting languages (Python, C / C++, Perl, Java, or similar).
  • Experience supporting highly available systems in large-scale data centers.
  • Familiarity with networking, firewalls, ACLs, and network protocols.
  • Knowledge of automation and monitoring tools (e.g., Kubernetes, Prometheus, Alertmanager).
  • Strong troubleshooting and communication skills.
  • Preferred (Level 3) :

  • 8+ years of relevant experience (or 6+ with a Master’s).
  • Expertise in software development and monitoring pipeline design.
  • Experience leading technical projects and mentoring junior staff.
  • Advanced knowledge of data center management technologies.
  • PI277924055

    Create a job alert for this search

    Site Reliability Engineer • Berkeley, California, United States, 94720

    Related jobs
    • Promoted
    Customer Reliability Engineer

    Customer Reliability Engineer

    VirtualVocationsHayward, California, United States
    Full-time
    A company is looking for a Customer Reliability Engineer III.Key Responsibilities Manage and resolve customer technical issues via support tickets and real-time interactions Act as a liaison bet...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Site Reliability Engineer

    Site Reliability Engineer

    ZapierSan Francisco, CA, United States
    Full-time
    We're humans who simply think computers should do more work.At Zapier, we’re not just making software—we’re building a platform to help millions of businesses globally scale with automation and AI....Show moreLast updated: less than 1 hour ago
    Site Reliability Engineer

    Site Reliability Engineer

    DTEX SystemsFremont, CA, US
    Full-time
    Quick Apply
    We are excited that you’ve taken the time to explore our business and potentially join us on this incredible journey.We are already the leader in the Insider Risk Management, but our story do...Show moreLast updated: 30+ days ago
    • Promoted
    Sr. Site Reliability Engineer

    Sr. Site Reliability Engineer

    CENTRL IncSan Francisco, CA, United States
    Full-time
    CENTRL is a rapidly growing Silicon Valley technology company specializing in third-party risk, due diligence, cyber risk, and security. With offices in the SF Bay Area, NY, Australia, and India, CE...Show moreLast updated: 2 days ago
    • Promoted
    Senior Site Reliability Engineer, Scalability

    Senior Site Reliability Engineer, Scalability

    Meraki, LLCSan Francisco, CA, United States
    Full-time
    Application window is open until further notice.The Infrastructure SRE team is responsible for the compute, storage and security underpinning Meraki's cloud in 10 data centers worldwide.Meraki's hi...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    VirtualVocationsHayward, California, United States
    Full-time
    A company is looking for a Principal Site Reliability Engineer.Key Responsibilities Lead project work to build and maintain platform features for reliability and cloud infrastructure Mentor serv...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Rollbar, Inc.San Francisco, CA, United States
    Full-time
    Wikimedia Foundation is hiring a Senior Site Reliability Engineer (SRE) to join our Service Operations SRE team, where we take care of the infrastructure that runs wikipedia.The SRE team at Wikimed...Show moreLast updated: 28 days ago
    • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    Harrison ClarkeSan Francisco, CA, United States
    Full-time
    Principal Site Reliability Engineer (SRE).The ideal candidate should have extensive experience in designing highly scalable infrastructure, building systems, and performing testing, monitoring, and...Show moreLast updated: 5 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    VirtualVocationsHayward, California, United States
    Full-time
    A company is looking for a Site Reliability Engineer.Key Responsibilities Become a subject matter expert in applications supporting customers Collaborate with teams to evaluate, deploy, and debu...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Developer

    Site Reliability Developer

    VirtualVocationsConcord, California, United States
    Full-time
    A company is looking for a Site Reliability Developer.Key Responsibilities Perform DevOps activities to support customers and engineers during release cycles and production Respond to incidents,...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer - Technical Lead

    Site Reliability Engineer - Technical Lead

    ZipRecruiterSan Francisco, CA, United States
    Full-time
    Veryon is a leading software and technology company that enables aviation teams around the world to improve efficiency and safety. Our products maximize uptime for aircraft maintenance teams through...Show moreLast updated: 5 days ago
    • Promoted
    Senior / Staff Site Reliability Engineer

    Senior / Staff Site Reliability Engineer

    FluidstackSan Francisco, CA, United States
    Full-time
    Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises.Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more. Our team is small, highly motivate...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    VirtualVocationsSan Francisco, California, United States
    Full-time
    A company is looking for a Senior Site Reliability Engineer.Key Responsibilities Design and implement infrastructure and automation scripts for AWS deployment and management Optimize and monitor...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CRM HikeSan Francisco, CA, United States
    Full-time
    Perplexity is seeking a Site Reliability Engineer (SRE) to join our small team in revolutionizing the way people search and interact with the internet. You will be responsible for leading the design...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    WritemedSan Francisco, CA, United States
    Full-time
    Would you like to join one of the fastest-growing organizations with a goal of using the latest AI, GenAI, LLM, Cloud, and Digital Technologies to advance drug development and improve patient care ...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer (SRE)

    Site Reliability Engineer (SRE)

    Air AppsSan Francisco, CA, United States
    Full-time
    At Air Apps, we believe in thinking bigger—and moving faster.We’re a family-founded company on a mission to create the world’s first AI-powered Personal & Entrepreneurial Resource Planner (PRP), an...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    PinterestSan Francisco, CA, United States
    Full-time
    Millions of people around the world come to our platform to find creative ideas, dream about new possibilities and plan for memories that will last a lifetime. At Pinterest, we're on a mission to br...Show moreLast updated: 3 days ago
    • Promoted
    Site Reliability Engineering Manager

    Site Reliability Engineering Manager

    VirtualVocationsSanta Rosa, California, United States
    Full-time
    A company is looking for a Manager, Software Engineer.Key Responsibilities Define and execute the strategic vision and roadmap for the Site Reliability Engineering function Provide leadership an...Show moreLast updated: 30+ days ago
    • Promoted
    Sr Site Reliability Engineer Denver, CO;San Francisco, CA;New York, NY;Seattle, WA;Toronto, Ont[...]

    Sr Site Reliability Engineer Denver, CO;San Francisco, CA;New York, NY;Seattle, WA;Toronto, Ont[...]

    GustoSan Francisco, CA, United States
    Full-time
    Gusto is a modern, online people platform that helps small businesses take care of their teams.On top of full-service payroll, Gusto offers health insurance, 401(k)s, expert HR, and team management...Show moreLast updated: 30+ days ago
    • Promoted
    Sr. Site Reliability Engineer

    Sr. Site Reliability Engineer

    CENTRL Inc.San Francisco, CA, United States
    Full-time
    CENTRL is a rapidly growing Silicon Valley technology company specializing in third-party risk, due diligence, cyber risk, and security. With offices in the SF Bay Area, NY, Australia, and India, CE...Show moreLast updated: 30+ days ago