Talent.com
Senior Site Reliability Engineer
Senior Site Reliability EngineerMango, Inc. • Los Angeles, CA, United States
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Mango, Inc. • Los Angeles, CA, United States
1 day ago
Job type
  • Full-time
Job description

We are seeking a Senior Site Reliability Engineer to own and evolve the infrastructure that supports our on-premise instruments, data systems, and machine learning pipelines. This role combines systems-level engineering with software craftsmanship, requiring deep understanding of how compute, storage, and networking layers interact under real workloads.You will be the go-to expert for diagnosing performance issues in our on-prem system. This could be from kernel-level I / O bottlenecks to distributed service latency. In addition to building robust automation that keeps our systems consistent and observable.Key ResponsibilitiesInfrastructure Design & Reliability Design, deploy, and maintain our on-premise and hybrid infrastructure which includes Dell PowerEdge and PowerVault servers, prosumer NAS units, and high-throughput data processing clusters. Implement fault-tolerant systems with reproducible deployments and clear observability.Performance & Systems Analysis Investigate complex performance issues across hardware, OS, and software boundaries. You will be using Linux toolin addition to in-house application-level metrics to uncover root causes in filesystems, caching layers, or I / O scheduling.Automation & Tooling Build automation for system provisioning, configuration management, and software deployment using Python, Go, Ansible, or similar frameworks. Develop lightweight services and tools that make reliability visible and maintainable.Collaboration Work closely with our software and hardware teams to co-design systems that meet the needs of high-resolution imaging and ML inference workloads. Translate hardware realities into software reliability guarantees.Observability & Incident Response Develop and maintain monitoring, alerting, and logging systems to ensure early detection of issues. Lead incident response and post-mortem efforts with a focus on learning and prevention.Documentation & Communication Produce clear documentation and communicate findings effectively to the broader team from network topology diagrams to kernel tuning rationales.General QualificationsDeep understanding of Linux systems and performance (I / O schedulers, RAID, caching, NUMA, kernel parameters).Hands-on experience designing and managing on-premise servers, storage arrays, or HPC clusters.Comfort with automation and software development (Python, Go, Bash, or similar).Strong diagnostic and analytical skills : ability to decompose performance problems across multiple layers.Proven track record of improving system reliability, throughput, and maintainability in a fast-paced environment.Excellent written and verbal communication skills for cross-disciplinary collaboration.Self-driven, curious, and motivated by understanding systems deeply rather than just maintaining them.Bonus Qualities (Not Required)510 years of relevant industry experience in systems engineering, SRE, or infrastructure software roles.Experience tuning Linux filesystems (ext4, btrfs) and software RAID (mdadm).Familiarity with containerization and orchestration (Docker, Compose, Kubernetes).Knowledge of networking fundamentals (VLANs, bonding, LACP, 10 GbE / 40 GbE).Experience supporting data-heavy scientific or ML workloads.Demonstrated technical leadership mentoring others in debugging, reliability, or performance analysis.

recblid a27ykxdqpvdzrj81gllu1mnyf3d85k

Create a job alert for this search

Senior Site Reliability Engineer • Los Angeles, CA, United States

Related jobs
Senior Site Reliability Engineer (SRE)

Senior Site Reliability Engineer (SRE)

StubHub • Los Angeles, CA, United States
Full-time
StubHub is on a mission to redefine the live event experience on a global scale.Whether someone is looking to attend their first event or their hundredth, we're here to delight them all the way fro...Show more
Last updated: 2 days ago • Promoted
Site Reliability Engineer

Site Reliability Engineer

Diverse Lynx • Los Angeles, CA, United States
Full-time
Must Have Technical / Functional Skills.Experience in Cloud platforms (AWS, Azure, Google Cloud) and hybrid environments. Proficiency in container technologies (Docker, Container, Podman).Strong knowl...Show more
Last updated: 30+ days ago • Promoted
Lead Site Reliability Engineer - Federal Team in Los Angeles

Lead Site Reliability Engineer - Federal Team in Los Angeles

Energy Jobline ZR • Los Angeles, CA, United States
Full-time
Energy Jobline is the largest and fastest growing global Energy Job Board and Energy Hub.We have an audience reach of over 7 million energy professionals, 400,000+ monthly advertised global energy ...Show more
Last updated: 2 days ago • Promoted
Senior Site Reliability Engineer / Los Angeles, CA / Hybrid

Senior Site Reliability Engineer / Los Angeles, CA / Hybrid

Motion Recruitment • Los Angeles, CA, United States
Full-time
A large gaming company is looking for a Senior Site Reliability Engineer to come join their team based in Los Angeles!.This person will be apart of a team of Site Reliability Engineers that leverag...Show more
Last updated: 2 days ago • Promoted
Lead Site Reliability Engineer (SRE)

Lead Site Reliability Engineer (SRE)

EPAM Systems Inc • Los Angeles, CA, United States
Full-time
At EPAM, we're not just building software - we're engineering excellence.Lead Site Reliability Engineer (SRE).This role is ideal for someone who thrives in fast-paced financial systems, has a passi...Show more
Last updated: 2 days ago • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

K2 Space • Los Angeles, CA, United States
Permanent
K2 Space is building large, high-powered spacecraft for the next generation of space development.Backed by Lightspeed Venture Partners, Altimeter Capital, and many others ($200M raised to date), we...Show more
Last updated: 30+ days ago • Promoted
Lead Site Reliability Engineer

Lead Site Reliability Engineer

Disqo • Los Angeles, CA, United States
Full-time
DISQO's mission is to build the world's most trusted ad measurement platform that fuels brand growth.The world's largest brands, agencies, and media companies trust DISQO for expert insight and AI-...Show more
Last updated: 30+ days ago • Promoted
Site Reliability Engineer in Los Angeles

Site Reliability Engineer in Los Angeles

Energy Jobline ZR • Los Angeles, CA, United States
Full-time
Energy Jobline is the largest and fastest growing global Energy Job Board and Energy Hub.We have an audience reach of over 7 million energy professionals, 400,000+ monthly advertised global energy ...Show more
Last updated: 2 days ago • Promoted
Site Reliability Engineer II

Site Reliability Engineer II

AEG • Los Angeles, CA, United States
Full-time
In order to be considered for this role, after clicking "Apply Now" above and being redirected, you must fully complete the application process on the follow-up screen. AXS connects fans with the ar...Show more
Last updated: 2 days ago • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

anduril • Costa Mesa, CA, United States
Full-time
Senior Site Reliability Engineer.Anduril Industries is a defense technology company with a mission to transform U.By bringing the expertise, technology, and business model of the 21st century's mos...Show more
Last updated: 1 day ago • Promoted
Senior Site Reliability Engineer (Remote)

Senior Site Reliability Engineer (Remote)

Experian • Costa Mesa, CA, United States
Remote
Full-time
Experian is a global data and technology company, powering opportunities for people and businesses around the world.We help to redefine lending practices, uncover and prevent fraud, simplify health...Show more
Last updated: 30+ days ago • Promoted
Senior Site Reliability Engineer - Developer, Connected Warfare

Senior Site Reliability Engineer - Developer, Connected Warfare

Anduril Industries • Costa Mesa, CA, United States
Full-time
Anduril Industries is a defense technology company with a mission to transform U.By bringing the expertise, technology, and business model of the 21st century's most innovative companies to the def...Show more
Last updated: 2 days ago • Promoted
DevOps / Site Reliability Engineer (SRE) US

DevOps / Site Reliability Engineer (SRE) US

Channelwill • Pasadena, CA, United States
Full-time
Pasadena, California (Remote or Hybrid).SaaS company based in Pasadena, California, providing innovative post-purchase solutions for eCommerce brands. Our products help merchants improve customer ex...Show more
Last updated: 2 days ago • Promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Mango • Los Angeles, CA, United States
Full-time
We are seeking a Senior Site Reliability Engineer to own and evolve the infrastructure that supports our on-premise instruments, data systems, and machine learning pipelines.This role combines syst...Show more
Last updated: 14 days ago • Promoted
Site Reliability Engineer, GNC (Falcon)

Site Reliability Engineer, GNC (Falcon)

SpaceX • Inglewood, CA, United States
Full-time
Site Reliability Engineer, GNC (Falcon).SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not.Today Sp...Show more
Last updated: 1 day ago • Promoted
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Diverse Lynx • Newport Coast, CA, United States
Full-time
DevOps Engineer With Strong Site Reliability Engineering Capabilities.Experienced DevOps Engineers with strong Site Reliability Engineering (SRE) capabilities who can work independently, think crit...Show more
Last updated: 2 days ago • Promoted
Lead Site Reliability Engineer - Federal Team

Lead Site Reliability Engineer - Federal Team

Saviynt • Los Angeles, CA, United States
Full-time
Lead Site Reliability Engineer - Federal Team.Saviynt is an identity authority platform built to power and protect the world at work. In a world of digital transformation, where organizations are fa...Show more
Last updated: 1 day ago • Promoted
Sr. Site Reliability Engineer

Sr. Site Reliability Engineer

Kesta IT • Culver City, CA, United States
Full-time +1
Come build, innovate, disrupt, and thrive!.Site Reliability Engineer for an immediate full-time opportunity with our industry leading client. Are you on the lookout for a unique career opportunity t...Show more
Last updated: 2 days ago • Promoted