Talent.com
Site Reliability Engineer
Site Reliability EngineerLTD Global, LLC • Berkeley, CA, United States
No longer accepting applications
Site Reliability Engineer

Site Reliability Engineer

LTD Global, LLC • Berkeley, CA, United States
30+ days ago
Job type
  • Full-time
Job description

Position overview :

We are seeking a Site Reliability Engineer to join our Operations Group. This role plays a key part in advancing scientific discovery by supporting high-performance computing (HPC) and data analysis for the organization.

Our center provides essential HPC and data systems to more than 10,000 researchers working in areas such as alternative energy, climate science, energy efficiency, environmental science, and other missions.

As a Site Reliability Engineer, you will be part of a 24 / 7 operations team that ensures our systems are accessible, reliable, secure, and available to the scientific community. You will work with a state-of-the-art data collection and monitoring system to maintain and optimize performance across complex HPC and data environments.

What You Will Do at Level 2

Work five shifts per week monitoring a large HPC facility, including 2–3 overnight shifts (midnight–8 a.m.) per week.

Split time between on-site and off-site shifts depending on staffing needs.

Review and respond to alerts from computing systems, storage, networks, and other data center / facility systems by triaging or escalating to on-call staff.

Develop solutions to improve processes, prevent recurrence of issues, and automate responses to routine service conditions.

Identify areas for improved monitoring and automation; propose and implement solutions.

Respond to monitoring alerts to ensure continuous 24 / 7 data collection for real-time diagnoses.

Develop and maintain tools within the monitoring pipeline in collaboration with the Operations Team.

Create software programs to provide alerts and notifications from HPC system APIs into the monitoring pipeline.

Configure software and solve technical issues to ensure programs scale reliably with increasing data volume.

Collaborate with other groups to ensure workflows are understood and maintained.

Assign technical tasks to other monitoring team members as needed.

Coordinate system maintenance activities and manage diagnostic and notification software during outages.

Provide accurate documentation in ticketing systems for outages, updates, and incidents.

Work on and resolve problems of diverse scope where analysis requires evaluation of identifiable factors.

Additional Responsibilities (For Level 3 only)

Provide leadership in developing monitoring and alerting pipelines, documentation, and software.

Contribute to the design and deployment of the monitoring cluster.

Partner with other technical groups to improve monitoring experiences.

Tackle complex problems requiring in-depth evaluation of variable factors.

Determine methods and procedures on new assignments and may coordinate activities of other team members.

Required Qualifications (Level 2)

Typically requires 5+ years of related experience with a Bachelor’s degree, or 3+ years with a Master’s degree, or equivalent work experience.

Strong hands-on knowledge of Linux shell and command-line environments.

Experience developing tools using languages such as C, C++, Perl, Java, or Python.

Knowledge of IT infrastructure and large data communication networks supporting highly available systems.

Ability to learn and work with data center management technologies (e.g., Kubernetes, Prometheus, alerting / monitoring tools, building management software, cooling / power systems).

Strong communication skills and ability to collaborate across multiple technical teams.

Experience working in a 24 / 7 operations team managing large data centers or installations.

Knowledge of network security, ACLs, firewalls, and protocols.

Relevant certifications in system administration or related areas.

Required Qualifications (Level 3)

Typically requires 8+ years of related experience with a Bachelor’s degree, or 6+ years with a Master’s degree, or equivalent.

Advanced expertise in one or more programming languages such as C, C++, Perl, Java, or Python.

Demonstrated excellence with monitoring and automation tools.

Experience leading technical projects.

Strong ability to respond proactively to complex issues.

Additional Details

Shift : Includes overnight “Owl” shifts (12 a.m. – 8 a.m.), primarily on-site.

This is a full-time, exempt position (monthly paid).

A background check is required. Convictions are reviewed in relation to job responsibilities and do not automatically disqualify applicants.

This position requires substantial on-site presence, but hybrid schedules may be available depending on business needs. Candidates must reside within 150 miles of the work site.

Powered by JazzHR

Create a job alert for this search

Site Reliability Engineer • Berkeley, CA, United States

Related jobs
Senior Forward Deployed Engineer

Senior Forward Deployed Engineer

Intercom • San Francisco, CA, United States
Full-time
Intercom is the AI Customer Service company on a mission to help businesses provide incredible customer experiences.Our AI agent Fin, the most advanced customer service AI agent on the market, lets...Show more
Last updated: 1 day ago • Promoted
Bomb Technical

Bomb Technical

U.S. Navy • Sausalito, CA, US
Full-time +1
To be eligible to enlist in the U.Navy, candidates must be between the ages of 18-34.Americans live for fireworks on the Fourth of July. The other 364 days of the year, Explosive Ordnance Disposal (...Show more
Last updated: 1 day ago • Promoted
Lead Site Reliability Engineer (SRE) in Concord

Lead Site Reliability Engineer (SRE) in Concord

Energy Jobline ZR • Concord, CA, United States
Full-time
Energy Jobline is the largest and fastest growing global Energy Job Board and Energy Hub.We have an audience reach of over 7 million energy professionals, 400,000+ monthly advertised global energy ...Show more
Last updated: 1 day ago • Promoted
Senior AI Platform Engineer

Senior AI Platform Engineer

University of California - Riverside • Oakland, CA, United States
Full-time
The Senior AI Platform Engineer is responsible for the technical design, development, and implementation of a comprehensive and scalable Generative AI platform for UC Riverside's faculty, staff, an...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

Compunnel • Richmond, CA, United States
Full-time
The Site Reliability Engineer will be responsible for ensuring the reliability, availability, and performance of applications and services as part of the transition from private to public cloud.Thi...Show more
Last updated: 13 hours ago • Promoted • New!
Deployed Engineer

Deployed Engineer

VirtualVocations • San Francisco, California, United States
Full-time
A company is looking for a Deployed Engineer to lead complex implementations and enhance developer productivity.Key Responsibilities Own technical delivery across multiple customer deployments, e...Show more
Last updated: 1 day ago • Promoted
AI Forward Deployed Engineer

AI Forward Deployed Engineer

Rockstar • San Francisco, CA, United States
Full-time
Rockstar is recruiting for a fast-growing, venture-backed SaaS company that is transforming enterprise accounting through powerful integrations, intuitive design, and AI-driven automation.The clien...Show more
Last updated: 1 day ago • Promoted
Founding Forward Deployed Engineer

Founding Forward Deployed Engineer

Vori, Inc • San Francisco, CA, United States
Full-time
At Vori, Forward Deployed Engineers bridge the gap between customer systems and our modern grocery platform.You'll work closely with onboarding managers to configure Vori for each retailer's unique...Show more
Last updated: 1 day ago • Promoted
Forward Deployed Engineer - Early Stage - High Growth Start Up

Forward Deployed Engineer - Early Stage - High Growth Start Up

CyRAD Solutions • San Francisco, CA, United States
Full-time
About the job Forward Deployed Engineer - Early Stage - High Growth Start Up.FORWARD DEPLOYED ENGINEER : AI'S FRONT LINE. The Mission : Programming the Future in English.We are building the next found...Show more
Last updated: 1 day ago • Promoted
Forward Deployed Engineer

Forward Deployed Engineer

Lamar Health • San Mateo, CA, US
Full-time
We take care of shitty paperwork for very expensive drugs.Competitive + mission-driven impact.Our CEO worked on the AI model for the COVID Pfizer vaccine. Our CTO built an AI model for optimizing cl...Show more
Last updated: 15 days ago • Promoted
Senior Site Reliability Engineer (Senior SRE)

Senior Site Reliability Engineer (Senior SRE)

Ciroos • Pleasanton, CA, United States
Full-time
Senior Site Reliability Engineer (Senior SRE).Be among the first 25 applicants.Ciroos (pronounced Sai rose) is a seed?stage startup founded in February 2025 by a team of experienced executives and ...Show more
Last updated: 1 day ago • Promoted
SRE / Platform Engineer

SRE / Platform Engineer

PromptQL • San Francisco, CA, United States
Full-time
Get AI-powered advice on this job and more exclusive features.We're a diverse group of engineers who thrive on tackling complex challenges, from optimizing our cloud infrastructure to creating intu...Show more
Last updated: 1 day ago • Promoted
Forward Deployed Engineer

Forward Deployed Engineer

Labelbox • San Francisco, CA, United States
Full-time
Shape the Future of AI At Labelbox, we're building the critical infrastructure that powers breakthrough AI models at leading research labs and enterprises. Since 2018, we've been pioneering data-cen...Show more
Last updated: 30+ days ago • Promoted
Forward Deployed Engineer (FDE)

Forward Deployed Engineer (FDE)

ConductorOne • San Francisco, CA, United States
Full-time
ConductorOne is the first AI-native identity security platform that protects every identity : human, non-human, and AI.With powerful automation, platform-level AI, and out-of-the-box connectors, it ...Show more
Last updated: 1 day ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

Rockwoods Inc • Pleasanton, CA, US
Full-time
Note : Candidates must have relevant experience in Medical / Healthcare domains, this is mandatory.Senior SRE Engineer - Pleasanton, 5 days office. Primary work : 24x7 On-call support and setting up mo...Show more
Last updated: 19 days ago • Promoted
BIOPHARMACEUTICAL - C&Q ENGINEER

BIOPHARMACEUTICAL - C&Q ENGINEER

MMR Consulting • Vacaville, CA, US
Full-time
MMR Consulting is an engineering and consulting firm specializing in the pharmaceutical and biotechnology industries.Its services include Engineering, Project Management, and other Consulting servi...Show more
Last updated: 30+ days ago • Promoted
Site Reliability engineering (SRE)

Site Reliability engineering (SRE)

TechDigital Group • San Leandro, CA, United States
Permanent
Java Dev background interested in this role with strong hands-on experience in building dashboards and setting up alerts using Splunk, Grafana and GCL. Software Engineering experience, or equivalent...Show more
Last updated: 1 day ago • Promoted
Forward Deployed Engineer

Forward Deployed Engineer

Palona AI • San Francisco, CA, United States
Full-time
Forward Deployed Engineer (FDE).You will embed deeply with users, translate real-world use cases into technical solutions, and ship production-grade systems powered by AI.You’ll own deployment comp...Show more
Last updated: 1 day ago • Promoted