Staff Site Reliability Engineer

CrusoeSan Francisco, California, United States

30+ days ago

Job type

Full-time

Job description

Crusoe is building the World’s Favorite AI-first Cloud infrastructure company. We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Crusoe is redefining AI cloud infrastructure, with a mission to align the future of computing with the future of the climate. Our AI platform is recognized as the "gold standard" for reliability and performance. Our data centers are optimized for AI workloads and are powered by clean, renewable energy.

Be part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role :

At Crusoe Energy Systems, our Site Reliability Engineering (SRE) team plays a pivotal role in ensuring the reliability and performance of our infrastructure. SRE at Crusoe is dedicated to detecting, analyzing, and preventing issues to maintain high Service Level Agreement through Service Level Indicators (SLIs) and Service Level Objectives (SLOs). Through automation and proactive remediation, our SREs not only resolve common errors automatically but also advise various engineering teams in building resilient code. We prioritize anticipating and resolving issues before they impact our customers, conducting thorough post-mortems, and driving continuous improvement. Our customer-centric approach ensures that clients always have access to the virtual machines they depend on. Join us to help build and maintain the robust systems that power Crusoe's innovative solutions.

A Day in the Life :

As a Site Reliability Engineer at Crusoe Energy Systems, your day begins with a review of overnight alerts and system performance metrics to ensure everything is running smoothly. You will collaborate with your team in a morning stand-up meeting to discuss ongoing projects, recent incidents, and priorities for the day. Your tasks might include automating routine processes, analyzing system logs, and developing tools to enhance our monitoring capabilities. You'll spend part of your day working closely with software engineers, advising on best practices for resilient code and reviewing changes before deployment. Regularly, you will engage in incident response drills, post-mortems, and root cause analysis sessions to learn from past issues and prevent future ones. Throughout the day, you will stay focused on maintaining high SLIs and SLOs, ensuring that our infrastructure remains robust and reliable for our customers. By day's end, you will document your work, share insights with your team, and plan for the next day's challenges, always with a customer-centric mindset.

You Will Thrive In This Role If :

8+ years of professional SRE experience

8+ years of experience contributing to architecture and design (architecture, design patterns, reliability and scaling) of new and current systems

Bachelor's Degree in Computer Science or related field, or 10+ years relevant work experience

Solid understanding of infrastructure design, including the operational trade-offs of various designs

Experience writing high quality code with at least one programming language (Python, Go, or similar)

Experience building with modern infrastructure tools such as Docker, Kubernetes, Ansible, Cloud Formation, Terraform

Experience building with modern CI / CD practices and build systems, such as GitLab CI / CD, CircleCI, GitHub Actions

Experience with logging, monitoring and alerting systems and tools

Experience with Unix / Linux environments

Experience with TCP / IP and network programming

Experience with information security best practices

Excellent communication skills

Must be able to pass a background check

Embody the Company values

Benefits : Hybrid work schedule

Industry competitive pay

Restricted Stock Units in a fast growing, well-funded technology company

Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

Employer contributions to HSA accounts

Paid Parental Leave

Paid life insurance, short-term and long-term disability

Teladoc

401(k) with a 100% match up to 4% of salary

Generous paid time off and holiday schedule

Cell phone reimbursement

Tuition reimbursement

Subscription to the Calm app

MetLife Legal

Company paid commuter benefit; $50 per pay period

Compensation Range :

Compensation will be paid up to $250,000 base salary. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex / gender, sexual preference / orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

Create a job alert for this search

Site Reliability Engineer • San Francisco, California, United States

Related jobs

Promoted

Senior / Staff Site Reliability Engineer

Gatik AiMountain View, California, United States

Full-time

Gatik, the leader in autonomous middle-mile logistics, is revolutionizing the B2B supply chain with its autonomous transportation-as-a-service (ATaaS) solution and prioritizing safe, consistent del...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

HiveSan Francisco, California, United States

Full-time

Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer - Inference

LambdaSan Francisco, California, United States

Full-time

In 2012, Lambda started with a crew of AI engineers publishing research at top machine-learning conferences.We began as an AI company built by AI engineers. Today, we're on a mission to be the world...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

XaiPalo Alto, California, United States

Full-time

AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

PsiquantumPalo Alto, California, United States

Full-time

Quantum computing holds the promise of humanity’s mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

BasetenSan Francisco, California, United States

Full-time

We’re a growing team of builders backed by top-tier investors, including.ML teams at enterprises and category-defining AI-native companies like. Baseten to power their core production workloads with...Show moreLast updated: 30+ days ago

Promoted

Senior Staff Site Reliability Engineer (Cortex Observability)

Palo Alto NetworksSanta Clara, California, United States

Full-time

At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and m...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

NatcastSunnyvale, California, United States

Full-time

Natcast (short for The National Center for the Advancement of Semiconductor Technology) is a new, purpose-built, non-profit entity created to operate the National Semiconductor Technology Center (N...Show moreLast updated: 30+ days ago

Promoted

Sr. Site Reliability Engineer

ProsperSan Francisco, California, United States

Full-time

As a Senior Site Reliability Engineer (SRE) at Prosper, you will be instrumental in enhancing the reliability, scalability, and maintainability of our technology platform.This role bridges the gap ...Show moreLast updated: 30+ days ago

Promoted

Senior Staff Site Reliability Engineer

CrusoeSan Francisco, California, United States

Full-time

Promoted

Staff Software Engineer, Site Reliability Engineer (SRE)

HarveySan Francisco, California, United States

Full-time

Harvey is a secure AI platform for legal and professional services that augments productivity and automates complex workflows. Harvey uses algorithms with reasoning-adept LLMs that have been customi...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

ReplitFoster City, California, United States

Full-time

Replit is the fastest way to turn ideas into software.With our powerful AI-powered Agent and Assistant, anyone can create and launch apps from natural language in just one click.Build and deploy fu...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer - Supercomputing

XaiPalo Alto, California, United States

Full-time

Promoted

Site Reliability Engineer

CheckrSan Francisco, California, United States

Full-time

Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 30+ days ago

Promoted

Lead Site Reliability Engineer

VisaFoster City, California, United States

Full-time

Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more t...Show moreLast updated: 30+ days ago

Promoted

Senior Site Reliability Engineer

CheckrSan Francisco, California, United States

Full-time

Promoted

Site Reliability Engineer

RunloopSan Francisco, California, United States

Full-time

Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show moreLast updated: 30+ days ago