Staff Site Reliability Engineer

CrusoeSan Francisco, CA, United States

1 day ago

Job type

Full-time

Job description

Crusoe is building the Worlds Favorite AI-first Cloud infrastructure company. Were pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Crusoe is redefining AI cloud infrastructure, with a mission to align the future of computing with the future of the climate. Our AI platform is recognized as the "gold standard" for reliability and performance. Our data centers are optimized for AI workloads and are powered by clean, renewable energy.

Be part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team thats setting the pace for responsible, transformative cloud infrastructure.

About This Role :

At Crusoe Energy Systems, our Site Reliability Engineering (SRE) team plays a pivotal role in ensuring the reliability and performance of our infrastructure. SRE at Crusoe is dedicated to detecting, analyzing, and preventing issues to maintain high Service Level Agreement through Service Level Indicators (SLIs) and Service Level Objectives (SLOs). Through automation and proactive remediation, our SREs not only resolve common errors automatically but also advise various engineering teams in building resilient code. We prioritize anticipating and resolving issues before they impact our customers, conducting thorough post-mortems, and driving continuous improvement. Our customer-centric approach ensures that clients always have access to the virtual machines they depend on. Join us to help build and maintain the robust systems that power Crusoe's innovative solutions.

A Day in the Life :

As a Site Reliability Engineer at Crusoe Energy Systems, your day begins with a review of overnight alerts and system performance metrics to ensure everything is running smoothly. You will collaborate with your team in a morning stand-up meeting to discuss ongoing projects, recent incidents, and priorities for the day. Your tasks might include automating routine processes, analyzing system logs, and developing tools to enhance our monitoring capabilities. You'll spend part of your day working closely with software engineers, advising on best practices for resilient code and reviewing changes before deployment. Regularly, you will engage in incident response drills, post-mortems, and root cause analysis sessions to learn from past issues and prevent future ones. Throughout the day, you will stay focused on maintaining high SLIs and SLOs, ensuring that our infrastructure remains robust and reliable for our customers. By day's end, you will document your work, share insights with your team, and plan for the next day's challenges, always with a customer-centric mindset.

You Will Thrive In This Role If :

8+ years of professional SRE experience

8+ years of experience contributing to architecture and design (architecture, design patterns, reliability and scaling) of new and current systems

Bachelor's Degree in Computer Science or related field, or 10+ years relevant work experience

Solid understanding of infrastructure design, including the operational trade-offs of various designs

Experience writing high quality code with at least one programming language (Python, Go, or similar)

Experience building with modern infrastructure tools such as Docker, Kubernetes, Ansible, Cloud Formation, Terraform

Experience building with modern CI / CD practices and build systems, such as GitLab CI / CD, CircleCI, GitHub Actions

Experience with logging, monitoring and alerting systems and tools

Experience with Unix / Linux environments

Experience with TCP / IP and network programming

Experience with information security best practices

Excellent communication skills

Must be able to pass a background check

Embody the Company values

Benefits : Hybrid work schedule

Industry competitive pay

Restricted Stock Units in a fast growing, well-funded technology company

Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

Employer contributions to HSA accounts

Paid Parental Leave

Paid life insurance, short-term and long-term disability

Teladoc

401(k) with a 100% match up to 4% of salary

Generous paid time off and holiday schedule

Cell phone reimbursement

Tuition reimbursement

Subscription to the Calm app

MetLife Legal

Company paid commuter benefit; $50 per pay period

Compensation Range :

Compensation will be paid up to $250,000 base salary. Restricted Stock Units are included in all offers. Compensation to be determined by the applicants education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex / gender, sexual preference / orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

#J-18808-Ljbffr

Create a job alert for this search

Site Reliability Engineer • San Francisco, CA, United States

Related jobs

Promoted
New!

Staff / Principal Site Reliability Engineer

VezaSan Francisco, CA, United States

Full-time

Staff / Principal Site Reliability Engineer.We are seeking an exceptional Staff / Principal Site Reliability Engineer to lead critical infrastructure initiatives and drive Innovation across our organiz...Show moreLast updated: 12 hours ago

Promoted

Staff Site Reliability Engineer, Storage

Epoch BiodesignSan Francisco, CA, United States

Full-time

Crusoe is building the World’s Favorite AI-first Cloud infrastructure company.We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to p...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago

Promoted

Reliability Engineer (Rotating Equipment)

Advantage TechnicalRodeo, CA, United States

Full-time

Reliability Engineer (Rotating Equipment).Contract : 1 year, could extend.Bachelor’s degree in mechanical engineering or related technical discipline. Minimum 5 years’ rotating equipment reliability ...Show moreLast updated: 6 days ago

Promoted

Staff Engineer, Site Reliability

ZapierSan Francisco, CA, United States

Full-time

Zapier is building a platform to help millions of businesses globally scale with automation and AI.Our mission is to make automation work for everyone by delivering products that delight our custom...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

Insight GlobalSanta Clara, CA, United States

Full-time

Insight Global is looking for a seasoned SRE to join one of our largest technology clients' multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working...Show moreLast updated: 7 days ago

Promoted

Site Reliability Engineer - Supercomputing

XaiSan Francisco, CA, United States

Full-time

Site Reliability Engineer - Supercomputing.We are seeking a talented Site Reliability Engineer (SRE) to join our SuperComputing team. In this role, you'll ensure the reliability, scalability, and pe...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

Runloop AISan Francisco, CA, United States

Full-time

Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show moreLast updated: 15 days ago

Promoted

Site Reliability Engineer

PSI QuantumPalo Alto, CA, United States

Full-time

Promoted

Staff Site Reliability Engineer

ZscalerSan Jose, CA, United States

Full-time

Serving thousands of enterprise customers around the world including 45% of Fortune 500 companies, Zscaler (NASDAQ : ZS) was founded in 2007 with a mission to make the cloud a safe place to do busin...Show moreLast updated: 7 days ago

Promoted

Staff Site Reliability Engineer- Federal

ClearanceJobsSan Jose, CA, United States

Full-time

Staff Site Reliability Engineer.We're looking for an experienced Staff Site Reliability Engineer to join our Government Cloud team, reporting to the Director-Site Reliability Engineering.This is a ...Show moreLast updated: 3 days ago

Promoted

Site Reliability Engineer

FractalSan Francisco, CA, United States

Full-time

This range is provided by Fractal.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Fractal Analytics is a strategic AI partner to Fortune 500 com...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

ReplitFoster City, CA, United States

Full-time

Replit is the agentic software creation platform that enables anyone to build applications using natural language.With millions of users worldwide and over 500,000 business users, Replit is democra...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer Staff

HPESan Jose, CA, United States

Full-time

Site Reliability Engineer Staff.This role has been designed as ‘Hybrid’ with an expectation that you will work on average 2 days per week from an HPE office. Hewlett Packard Enterprise is the global...Show moreLast updated: 7 days ago

Promoted

Staff Site Reliability Engineer - Kubernetes

FivetranOakland, CA, United States

Full-time

From Fivetran's founding until now, our mission has remained the same : to make access to data as simple and reliable as electricity. With Fivetran, customer data arrives in their warehouses, canonic...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer I

Prosper.comSan Francisco, CA, United States

Full-time

As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show moreLast updated: 7 days ago

Promoted
New!

Staff Site Reliability Engineer (SRE)

HeartFlowSan Francisco, CA, United States

Full-time

Heartflow is a medical technology company advancing the diagnosis and management of coronary artery disease, the #1 cause of death worldwide, using cutting-edge technology.The flagship product—an A...Show moreLast updated: 9 hours ago

Promoted

Staff Site Reliability Engineer, Fabric

MongoDBSan Francisco, CA, United States

Full-time

Staff Site Reliability Engineer, Fabric.MongoDBs mission is to empower innovators to create, transform, and disrupt industries by unleashing the power of software and data.We enable organizations o...Show moreLast updated: 30+ days ago