Staff / Principal Site Reliability Engineer

VezaSan Francisco, CA, United States

11 hours ago

Job type

Full-time

Job description

Staff / Principal Site Reliability Engineer

We are seeking an exceptional Staff / Principal Site Reliability Engineer to lead critical infrastructure initiatives and drive Innovation across our organization. You’ll architect scalable solutions, navigate complex technical challenges independently, and deliver results under tight deadlines in a fast paced environment. You will work cross‑functionally alongside builders who have helped shape the success of companies such all ways as Google, Okta, AWS, and Snowflake.

Strategic Leadership & Technical Execution

Lead enterprise‑wide reliability and infrastructure projects across multiple teams with high autonomy
Navigate ambiguous problem spaces and deliver innovative solutions under tight deadlines
Architect and deploy solutions for Cloud Prem and SaaS customers at scale
Drive technical innovation and establish SRE best practices across the organization
Respond to critical incidents, lead root cause analysis, and implement long‑term resolutions
Develop automation solutions to streamline operations and reduce manual workload
Participate in on‑call rotation and ensure effective incident handoff and documentation

Cross‑Functional Collaboration & Communication

Partner with Engineering, Product, and Customer Success teams to align reliability goals with business objectives

Communicate complex technical concepts effectively to technical and non‑technical audiences, including executives

Influence technical decisions across teams through thought leadership and demonstrated expertise

Build consensus and Drive adoption of new tools, processes, and architectural patterns

Customer‑Facing Technical Leadership

Provide tier 2 / 3 technical support to enterprise customers for complex troubleshooting

Work directly with customer technical teams to resolve deployment, configuration, and integration challenges

Conduct technical onboarding and provide expert guidance on platform architecture and best practices

Create customer‑facing documentation, troubleshooting guides, and run‑books

Lead customer calls and technical discussions as a trusted advisor

Team Development

Mentor SRE and engineering team members, elevating technical capabilities

Foster a culture of reliability, operational excellence, and continuous improvement

You have : Required Experience

BS degree in Computer Science or related field (or equivalent practical experience)

7+ years in Site Reliability Engineering, DevOps, or Infrastructure Engineering

Proven track record leading large‑scale, cross‑team infrastructure projects from conception to production

Demonstrated ability to work autonomously on ambiguous projects with tight deadlines

Technical Expertise

5+ years with AWS (VPC, EC2, RDS, EKS, CloudFormation) and cloud automation

Expert‑level experience with Kubernetes, Helm, Linux, and Terraform

Strong experience with GitOps model, distributed version control, and CI / CD pipelines

Proficiency with monitoring tools (Prometheus, Grafana, DataDog)

Strong programming / scripting skills (Python, Go, Bash) for automation

Deep understanding of distributed systems, microservices, and reliability patterns

Experience with Bazel and CueLang a plus

Leadership & Communication

Exceptional ability to articulate complex technical concepts to diverse audiences

Track record of Driving technical change across organizational boundaries

Successfully Delivered multiple complex projects under tight deadlines

Strong customer service orientation with patience and empathy

Work Style

Thrives in ambiguous environments and makes progress without perfect information

Hands‑on, "can do" attitude with bias for action

Low ego and high intellectual curiosity

Comfortable working across time zones

Self‑motivated with strong ownership mentality

Compensation Disclosure

$184,000—$240,000 USD

Compensation depends on skills, qualifications, experience, and work location. Variable compensation such as commission is not included.

Our Culture

Ownership Mindset

Act with Integrity

Guardians of our Customers

Opinionated Humility

Build Trust, Earn Trust

Veza is proud to be an equal opportunity employer. We are committed to equal employment opportunities regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, or other applicable legally protected characteristics. We also consider qualified applicants according to applicable federal, state, and local laws. If a candidate with a disability requires an accommodation during the recruitment process, please email recruiting@veza.com.

#J-18808-Ljbffr

Create a job alert for this search

Site Reliability Engineer • San Francisco, CA, United States

Related jobs

Promoted

Principal Site Reliability Engineer

GenentechSouth San Francisco, CA, United States

Full-time

It's what drives us to innovate.To continuously advance science and ensure everyone has access to the healthcare they need today and for generations to come. Creating a world where we all have more ...Show moreLast updated: 2 days ago

Promoted

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States

Full-time

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago

Promoted

Staff Site Reliability Engineer

CrusoeSan Francisco, CA, United States

Full-time

Crusoe is building the Worlds Favorite AI-first Cloud infrastructure company.Were pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to pow...Show moreLast updated: 1 day ago

Promoted

Principal Site Reliability Engineer

FortinetSanta Clara, CA, United States

Full-time

At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...Show moreLast updated: 24 days ago

Promoted

Staff Engineer, Site Reliability

ZapierSan Francisco, CA, United States

Full-time

Zapier is building a platform to help millions of businesses globally scale with automation and AI.Our mission is to make automation work for everyone by delivering products that delight our custom...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer - Supercomputing

XaiSan Francisco, CA, United States

Full-time

Site Reliability Engineer - Supercomputing.We are seeking a talented Site Reliability Engineer (SRE) to join our SuperComputing team. In this role, you'll ensure the reliability, scalability, and pe...Show moreLast updated: 1 day ago

Promoted

Site Reliability Engineer

Runloop AISan Francisco, CA, United States

Full-time

Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show moreLast updated: 15 days ago

Promoted

Principal Site Reliability Engineer

Hewlett Packard Enterprise Development LPSan Jose, CA, United States

Full-time

Principal Site Reliability Engineer.This role has been designed as 'Hybrid' with an expectation that you will work on average 2 days per week from an HPE office. Hewlett Packard Enterprise is the gl...Show moreLast updated: 7 days ago

Promoted

Site Reliability Engineer

PSI QuantumPalo Alto, CA, United States

Full-time

Promoted

Staff Site Reliability Engineer

ZscalerSan Jose, CA, United States

Full-time

Serving thousands of enterprise customers around the world including 45% of Fortune 500 companies, Zscaler (NASDAQ : ZS) was founded in 2007 with a mission to make the cloud a safe place to do busin...Show moreLast updated: 7 days ago

Promoted

Site Reliability Engineer

ReplitFoster City, CA, United States

Full-time

Replit is the agentic software creation platform that enables anyone to build applications using natural language.With millions of users worldwide and over 500,000 business users, Replit is democra...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer Staff

HPESan Jose, CA, United States

Full-time

Site Reliability Engineer Staff.This role has been designed as ‘Hybrid’ with an expectation that you will work on average 2 days per week from an HPE office. Hewlett Packard Enterprise is the global...Show moreLast updated: 7 days ago

Promoted

Staff Site Reliability Engineer - Kubernetes

FivetranOakland, CA, United States

Full-time

From Fivetran's founding until now, our mission has remained the same : to make access to data as simple and reliable as electricity. With Fivetran, customer data arrives in their warehouses, canonic...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer I

Prosper.comSan Francisco, CA, United States

Full-time

As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show moreLast updated: 7 days ago

Promoted

Site Reliability Engineer

P2PSan Francisco, CA, United States

Full-time

Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...Show moreLast updated: 30+ days ago

Promoted

Principal Site Reliability Engineer

HPESan Jose, CA, United States

Full-time

Principal Site Reliability Engineer.This role has been designed as ‘Hybrid’ with an expectation that you will work on average 2 days per week from an HPE office. Hewlett Packard Enterprise is the gl...Show moreLast updated: 7 days ago

Promoted

Senior / Principal Site Reliability Engineer

DatacrunchSan Francisco, CA, United States

Full-time +1

Imagine a future where everyone has instant, low-cost access to intelligence.We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI m...Show moreLast updated: 3 days ago

Promoted

Staff Site Reliability Engineer, Fabric

MongoDBSan Francisco, CA, United States

Full-time

Staff Site Reliability Engineer, Fabric.MongoDBs mission is to empower innovators to create, transform, and disrupt industries by unleashing the power of software and data.We enable organizations o...Show moreLast updated: 30+ days ago