Talent.com
Senior / Staff Site Reliability Engineer, Storage

Senior / Staff Site Reliability Engineer, Storage

FluidstackSan Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

About Fluidstack

Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more.

Our team is small, highly motivated, and focused on providing a world class supercomputing experience. We put out customers first in everything we do, working hard to not just win the sale, but to win repeated business and customer referrals.

We hold ourselves and each other to high standards. We expect you to care deeply about the work you do, the products you build, and the experience our customers have in every interaction with us.

You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset.

About the Role

Our Senior / Staff Site Reliability Engineer (Storage) is the connective tissue of Fluidstack’s platform. As part of a small, senior team you’ll own the availability, performance and cost-efficiency of our storage, compute and networking layers. You’ll combine software engineering, systems thinking and a relentless customer focus to keep our SLIs and SLOs razor-sharp — and to raise the bar every quarter.

Focus

Automate everything. Replace repetitive ops with Python / Go tooling, Kubernetes operators and GitOps workflows.

Tune the stack. Profile and optimise storage I / O paths, hypervisors and kernel parameters to crush tail-latency.

Harden for scale. Design failure-tolerant architectures, run game-days and embed chaos engineering to validate them.

Own incidents. Lead 24×7 on-call rotations, drive blameless post-mortems and turn lessons into lasting fixes.

Partner with engineers. Review designs, instrument new services and evangelise reliability patterns across product teams.

Measure what matters. Define SLIs / SLOs that map directly to customer experience and build dashboards / alerts to track them.

Drive continuous improvement. Identify tech debt, propose roadmap items and mentor engineers on reliability best practice.

About you

10+ yrs professional SRE / production-engineering experience, including large-scale architecture & design.

Proficiency in Python, Go or similar; able to write clean, tested, maintainable code.

Deep hands-on knowledge of Docker, Kubernetes, Terraform / Ansible, and modern CI / CD (GitLab, GitHub Actions, etc.).

Expertise in observability stacks (Prometheus, Grafana, OpenTelemetry) and incident-management workflows.

Strong grasp of Linux internals, TCP / IP networking and security best-practices.

Excellent written & verbal communication; comfortable leading cross-functional deep-dives.

Benefits

Competitive total compensation package (cash + equity).

Retirement or pension plan, in line with local norms.

Health, dental, and vision insurance.

Generous PTO policy, in line with local norms.

#J-18808-Ljbffr

Create a job alert for this search

Senior Site Reliability Engineer • San Francisco, CA, United States

Related jobs
  • Promoted
Staff Site Reliability Engineer

Staff Site Reliability Engineer

CrusoeSan Francisco, CA, United States
Full-time
Crusoe is building the World’s Favorite AI-first Cloud infrastructure company.We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to p...Show moreLast updated: 30+ days ago
  • Promoted
  • New!
Senior Site Reliability Engineer, Supply

Senior Site Reliability Engineer, Supply

MithrilSan Francisco, CA, United States
Full-time
Senior Site Reliability Engineer, Supply.Senior Site Reliability Engineer, Supply.Continue with Google Continue with Google. Senior Site Reliability Engineer, Supply.Senior Site Reliability Engineer...Show moreLast updated: 8 hours ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States
Full-time
Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer, Storage

Senior Site Reliability Engineer, Storage

Epoch BiodesignSan Francisco, CA, United States
Full-time
Crusoe Energy is on a mission to unlock value in stranded energy resources through the power of computation.Take a look at what we do! - https : / / www. We aim to align the long term interests of the c...Show moreLast updated: 1 day ago
  • Promoted
Senior / Staff Site Reliability Engineer

Senior / Staff Site Reliability Engineer

CrusoeSan Francisco, CA, United States
Full-time
Crusoe Energy is on a mission to unlock value in stranded energy resources through the power of computation.We aim to align the long term interests of the climate with the future of global computin...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Rollbar, Inc.San Francisco, CA, United States
Full-time
Wikimedia Foundation is hiring a Senior Site Reliability Engineer (SRE) to join our Service Operations SRE team, where we take care of the infrastructure that runs wikipedia.The SRE team at Wikimed...Show moreLast updated: 1 day ago
  • Promoted
  • New!
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Citizen HealthSan Francisco, CA, United States
Full-time
Senior Site Reliability Engineer.Citizen Health was founded on the belief that having the right advocate is the single most important factor in achieving better care and outcomes.By uniquely combin...Show moreLast updated: 8 hours ago
  • Promoted
Senior / Staff Site Reliability Engineer, Compute

Senior / Staff Site Reliability Engineer, Compute

FluidstackSan Francisco, CA, United States
Full-time
Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises.Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more. Our team is small, highly motivate...Show moreLast updated: 30+ days ago
  • Promoted
Staff Site Reliability Engineer, Fabric

Staff Site Reliability Engineer, Fabric

MongoDBSan Francisco, CA, United States
Full-time
MongoDB's mission is to empower innovators to create, transform, and disrupt industries by unleashing the power of software and data. We enable organizations of all sizes to easily build, scale, and...Show moreLast updated: 30+ days ago
  • Promoted
Staff Site Reliability Engineer

Staff Site Reliability Engineer

Elios TalentSan Francisco, CA, United States
Full-time
Staff Site Reliability Engineer.We are seeking a Staff Site Reliability Engineer (SRE) to ensure the availability, scalability, and performance of mission-critical systems.You will design disaster ...Show moreLast updated: 1 day ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Loft OrbitalSan Francisco, CA, United States
Full-time
Loft Orbital is revolutionizing access to space by building reliable, shareable satellites that drastically reduce the time and complexity traditionally required to get to orbit.We operate satellit...Show moreLast updated: 1 day ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

CheckrSan Francisco, CA, United States
Full-time
Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 1 day ago
  • Promoted
  • New!
Senior Site Reliability Engineer

Senior Site Reliability Engineer

CanonicalSan Francisco, CA, United States
Full-time
Senior Site Reliability Engineer.Location : Globally remote role.Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets.Our pla...Show moreLast updated: 8 hours ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

HiveSan Francisco, CA, United States
Full-time
Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show moreLast updated: 30+ days ago
  • Promoted
Site Reliability Engineer

Site Reliability Engineer

VirtualVocationsFremont, California, United States
Full-time
A company is looking for a Mid-Sr.Site Reliability Engineer with a focus on on-prem Kubernetes / K8s.Key Responsibilities Manage and maintain on-premise containerized environments Deploy resources...Show moreLast updated: 30+ days ago
  • Promoted
Senior Site Reliability Engineer Denver, Colorado, United States; San Francisco, California, Un[...]

Senior Site Reliability Engineer Denver, Colorado, United States; San Francisco, California, Un[...]

CheckrSan Francisco, CA, United States
Full-time
Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 1 day ago
  • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Checkr, Inc.San Francisco, CA, United States
Full-time
Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 1 day ago
  • Promoted
Associate Site Reliability Engineer / Site Reliability Engineer

Associate Site Reliability Engineer / Site Reliability Engineer

MedStar HealthRedwood City, CA, United States
Full-time
C3 AI (NYSE : AI), is the Enterprise AI application software company.C3 AI delivers a family of fully integrated products including the C3 Agentic AI Platform, an end-to-end platform for developing,...Show moreLast updated: 30+ days ago