Staff Site Reliability Engineer

CrusoeSan Francisco, California, US

1 day ago

Job type

Full-time

Job description

Crusoe is building the World’s Favorite AI-first Cloud infrastructure company. We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Crusoe is redefining AI cloud infrastructure, with a mission to align the future of computing with the future of the climate. Our AI platform is recognized as the "gold standard" for reliability and performance. Our data centers are optimized for AI workloads and are powered by clean, renewable energy.

Be part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role :

At Crusoe Energy Systems, our Site Reliability Engineering (SRE) team plays a pivotal role in ensuring the reliability and performance of our infrastructure. SRE at Crusoe is dedicated to detecting, analyzing, and preventing issues to maintain high Service Level Agreement through Service Level Indicators (SLIs) and Service Level Objectives (SLOs). Through automation and proactive remediation, our SREs not only resolve common errors automatically but also advise various engineering teams in building resilient code. We prioritize anticipating and resolving issues before they impact our customers, conducting thorough post-mortems, and driving continuous improvement. Our customer-centric approach ensures that clients always have access to the virtual machines they depend on. Join us to help build and maintain the robust systems that power Crusoe's innovative solutions.

A Day in the Life :

As a Site Reliability Engineer at Crusoe Energy Systems, your day begins with a review of overnight alerts and system performance metrics to ensure everything is running smoothly. You will collaborate with your team in a morning stand-up meeting to discuss ongoing projects, recent incidents, and priorities for the day. Your tasks might include automating routine processes, analyzing system logs, and developing tools to enhance our monitoring capabilities. You'll spend part of your day working closely with software engineers, advising on best practices for resilient code and reviewing changes before deployment. Regularly, you will engage in incident response drills, post-mortems, and root cause analysis sessions to learn from past issues and prevent future ones. Throughout the day, you will stay focused on maintaining high SLIs and SLOs, ensuring that our infrastructure remains robust and reliable for our customers. By day's end, you will document your work, share insights with your team, and plan for the next day's challenges, always with a customer-centric mindset.

Is this the next step in your career Find out if you are the right candidate by reading through the complete overview below.

You Will Thrive In This Role If :

8+ years of professional SRE experience

8+ years of experience contributing to architecture and design (architecture, design patterns, reliability and scaling) of new and current systems

Bachelor's Degree in Computer Science or related field, or 10+ years relevant work experience

Solid understanding of infrastructure design, including the operational trade-offs of various designs

Experience writing high quality code with at least one programming language (Python, Go, or similar)

Experience building with modern infrastructure tools such as Docker, Kubernetes, Ansible, Cloud Formation, Terraform

Experience building with modern CI / CD practices and build systems, such as GitLab CI / CD, CircleCI, GitHub Actions

Experience with logging, monitoring and alerting systems and tools

Experience with Unix / Linux environments

Experience with TCP / IP and network programming

Experience with information security best practices

Excellent communication skills

Must be able to pass a background check

Embody the Company values

Benefits : Hybrid work schedule

Industry competitive pay

Restricted Stock Units in a fast growing, well-funded technology company

Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

Employer contributions to HSA accounts

Paid Parental Leave

Paid life insurance, short-term and long-term disability

Teladoc

401(k) with a 100% match up to 4% of salary

Generous paid time off and holiday schedule

Cell phone reimbursement

Tuition reimbursement

Subscription to the Calm app

MetLife Legal

Company paid commuter benefit; $50 per pay period

Compensation Range :

Compensation will be paid up to $250,000 base salary. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex / gender, sexual preference / orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

#J-18808-Ljbffr

Create a job alert for this search

Site Reliability Engineer • San Francisco, California, US

Related jobs

Promoted

Staff Site Reliability Engineer, Storage

Epoch BiodesignSan Francisco, CA, United States

Full-time

Promoted

Site Reliability Engineer

FortinetSunnyvale, CA, United States

Full-time

At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...Show moreLast updated: 5 days ago

Promoted

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago

Promoted

Staff Engineer, Site Reliability

ZapierSan Francisco, CA, United States

Full-time

Zapier is building a platform to help millions of businesses globally scale with automation and AI.Our mission is to make automation work for everyone by delivering products that delight our custom...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Insight GlobalSanta Clara, CA, United States

Full-time

Insight Global is looking for a seasoned SRE to join one of our largest technology clients' multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working...Show moreLast updated: 5 days ago

Promoted

Site Reliability Engineer

Redwood Materials, Inc.San Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Runloop AISan Francisco, CA, United States

Full-time

Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show moreLast updated: 14 days ago

Promoted

Site Reliability Engineer

XaiPalo Alto, CA, United States

Full-time

AIs mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellen...Show moreLast updated: 5 days ago

Promoted

Site Reliability Engineer

PSI QuantumPalo Alto, CA, United States

Full-time

Promoted

Staff Site Reliability Engineer

ZscalerSan Jose, CA, United States

Full-time

Serving thousands of enterprise customers around the world including 45% of Fortune 500 companies, Zscaler (NASDAQ : ZS) was founded in 2007 with a mission to make the cloud a safe place to do busin...Show moreLast updated: 5 days ago

Promoted

Staff Site Reliability Engineer- Federal

ClearanceJobsSan Jose, CA, United States

Full-time

Staff Site Reliability Engineer.We're looking for an experienced Staff Site Reliability Engineer to join our Government Cloud team, reporting to the Director-Site Reliability Engineering.This is a ...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer Staff

HPESan Jose, CA, United States

Full-time

Site Reliability Engineer Staff.This role has been designed as ‘Hybrid’ with an expectation that you will work on average 2 days per week from an HPE office. Hewlett Packard Enterprise is the global...Show moreLast updated: 5 days ago

Promoted

Staff Site Reliability Engineer - Kubernetes

FivetranOakland, CA, United States

Full-time

From Fivetran's founding until now, our mission has remained the same : to make access to data as simple and reliable as electricity. With Fivetran, customer data arrives in their warehouses, canonic...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer I

Prosper.comSan Francisco, CA, United States

Full-time

As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show moreLast updated: 5 days ago

Promoted

Site Reliability Engineer - Supercomputing

XaiPalo Alto, CA, United States

Full-time

Site Reliability Engineer - Supercomputing.We are seeking a talented Site Reliability Engineer (SRE) to join our SuperComputing team. In this role, you'll ensure the reliability, scalability, and pe...Show moreLast updated: 5 days ago

Promoted

Site Reliability Engineer

Rockwoods IncPleasanton, CA, US

Full-time

Note : Candidates must have relevant experience in Medical / Healthcare domains, this is mandatory.Senior SRE Engineer - Pleasanton, 5 days office. Primary work : 24x7 On-call support and setting up mo...Show moreLast updated: 23 days ago

Promoted

Site Reliability Engineer

FractalSan Francisco, California, US

Full-time

This range is provided by Fractal.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Increase your chances of reaching the interview stage by readi...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer II

Hinge HealthSan Francisco, California, US

Full-time

Ensure all your application information is up to date and in order before applying for this opportunity.From scaling Kubernetes clusters to improving observability with Datadog, we build the toolin...Show moreLast updated: 1 day ago