Talent.com
Lead Site Reliability Engineer
Lead Site Reliability EngineerDISQO • Los Angeles, CA, US
No longer accepting applications
Lead Site Reliability Engineer

Lead Site Reliability Engineer

DISQO • Los Angeles, CA, US
15 days ago
Job type
  • Full-time
Job description

Job Description

Job Description

DISQO’s mission is to build the world’s most trusted ad measurement platform that fuels brand growth. The world’s largest brands, agencies, and media companies trust DISQO for expert insight and AI-driven intelligence about their advertising performance across all platforms. We capture people’s sentiments and journeys,  connecting them with the brands they value and the media they consume. With this identity-based approach, brands gain more accurate and authentic insight so they can create more meaningful interactions.

When you join DISQO Nation, you join a community that values trust, transparency and innovation. We invest in our employees and apply a bottom-up management approach, rooted in the concept of servant leadership. We approach each day eager to learn, grow, and make a lasting impact. Best of all, we have fun while doing it!

About the Role :

We are seeking an experienced Lead Site Reliability Engineer to join our engineering team and drive the reliability, scalability, and performance of our production systems through innovative use of AI and automation. In this role, you will lead SRE initiatives, mentor team members, and leverage AI technologies to enhance operational excellence, predictive maintenance, and intelligent automation across our infrastructure.

Key Responsibilities :

Technical Leadership :

  • Design and implement comprehensive monitoring, alerting, and observability solutions, leveraging AI for intelligent anomaly detection and root cause analysis
  • Lead incident response efforts using AI-assisted diagnostics and automated remediation, conduct post-mortems, and drive systemic improvements
  • Develop and maintain service level objectives (SLOs) and error budgets with AI-powered predictive analytics to forecast reliability risks
  • Architect and implement intelligent automation solutions for deployment, scaling, and infrastructure management using machine learning models
  • Drive capacity planning and performance optimization using AI forecasting models and predictive analytics

AI-Enhanced SRE Leadership :

  • Implement and maintain AI-powered incident prediction and prevention systems
  • Design intelligent alerting systems that reduce noise and provide contextual insights using natural language processing and machine learning
  • Develop AI-driven capacity planning models that predict resource needs and optimize cost efficiency
  • Build and maintain chatbots and AI assistants for operational tasks, documentation search, and incident triage
  • Implement automated root cause analysis using AI correlation engines and log analysis
  • Team Leadership & Collaboration :

  • Mentor junior SREs on integrating AI tools and practices into traditional SRE workflows
  • Partner with engineering teams to embed AI-enhanced reliability principles into the software development lifecycle
  • Lead cross-functional initiatives to implement AI-driven operational improvements
  • Collaborate with data science teams to develop custom AI models for operational use cases
  • Participate in on-call rotations while developing AI systems to minimize toil and improve response efficiency
  • Strategic Initiatives :

  • Develop and execute an SRE roadmap aligned with business objectives and technological advancement
  • Evaluate and implement new AI tools and technologies to improve system reliability, security and operational efficiency
  • Drive adoption of AI-powered engineering and predictive failure testing
  • Establish metrics and reporting using AI analytics to demonstrate the business value of intelligent reliability investments
  • Required Qualifications :

  • 6+ years of experience in Site Reliability Engineering, DevOps, or similar infrastructure-focused roles
  • 2+ years of experience leading technical teams or initiatives
  • Strong experience with AI / ML tools and frameworks applied to operational use cases (anomaly detection, predictive analytics, NLP)
  • Hands-on experience implementing AI-powered monitoring, alerting, and automation solutions
  • Strong programming skills in Python with experience in AI / ML libraries
  • Extensive experience with cloud platforms (AWS, GCP,) and their AI / ML services
  • Knowledge of prompt engineering, LLM integration, and building AI-powered operational tools
  • Proficiency with infrastructure as code and configuration management with AI-enhanced workflows
  • Experience with time series analysis, statistical modeling, and predictive analytics for infrastructure metrics
  • Deep understanding of monitoring and observability tools enhanced with AI capabilities
  • Experience with CI / CD pipelines incorporating AI-driven quality gates and automated decision making
  • Strong knowledge of networking, distributed systems, and database technologies
  • Expert level knowledge in following domains : AWS ( core services, networking, compute, databases, storage, etc.. ) TerraformKubernetes / Karpetner / Helm
  • Strong experience building in-house observability platforms, including : OpenTelemetryLokiGrafanaPrometheusAWS CloudwatchAWS X-Ray or Jaeger
  • Experience in ArgoCD / ArgoWorkflow will be big plus
  • Bachelor’s degree in Computer Science, Engineering, Data Science, or equivalent practical experience
  • Preferred Qualifications :

  • Advanced experience with large language models (LLMs) for operational documentation, code generation, and incident response
  • Experience with automated incident response systems using AI decision engines
  • Experience with microservices architecture and intelligent service mesh management
  • Familiarity with AI-powered security tools and anomaly detection for infrastructure protection
  • Experience building and maintaining AI-driven dashboards and reporting systems
  • Experience with AI-powered cost optimization and resource right-sizing tools
  • Certification in relevant cloud platforms
  • This is a structured hybrid role based out of our Glendale, CA office. Your pay will be determined by your experience, work location, and other applicable factors.

    #LI-MV1

    At DISQO, we pride ourselves on having a positive, performance-oriented workplace that includes a flexible hybrid approach, competitive medical benefits, and an amazing vacation policy. Read more about our culture on Glassdoor.

    You can learn more about what’s happening at DISQO by visiting the DISQO Developer Blog or the DISQO Company Blog.

    Perks & Benefits :

  • 100% covered Medical / Dental / Vision for employee, competitive dependent coverage
  • Equity
  • 401K
  • Generous PTO policy
  • Flexible workplace policy
  • Team offsites, social events & happy hours
  • Life Insurance
  • Health FSA
  • Commuter FSA (for hybrid employees)
  • Catered lunch and fully stocked kitchen
  • Paid Maternity / Paternity leave
  • Disability Insurance
  • Travel Assistance Program
  • 24 / 7 Counseling Services offered to Employees
  • Note : The benefits noted above are for full time US based employees only.

    DISQO is an equal opportunity employer. Discovery, innovation, and growth are possible when we open ourselves to new possibilities, perspectives, and approaches. That’s why, at DISQO, we welcome, support, and empower individuals from diverse backgrounds. Exceptional teams are rooted in extraordinary people, each with a unique story and a compelling set of skills. DISQO does not discriminate against employees based on race, color, religion, sex, national origin, gender identity or expression, age, disability, pregnancy (including childbirth, breastfeeding, or related medical condition), genetic information, protected military or veteran status, sexual orientation, or any other characteristic protected by applicable federal, state or local laws.

  • Recruiting firms that submit resumes to DISQO without first entering into a written contract will not be entitled to any compensation on candidates referred by that firm.
  • We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

    Create a job alert for this search

    Site Reliability Engineer • Los Angeles, CA, US

    Related jobs
    Systems Engineer (Reliability, Maintainability & Availability – RMA)

    Systems Engineer (Reliability, Maintainability & Availability – RMA)

    G2 Ops, Inc. • El Segundo, CA, US
    Full-time
    Quick Apply
    El Segundo, CA at our customer site Work Setting : In person, some remote opportunity, and / or flexible working hours, not a fully remote position Salary Range : $105,000 – 160,000 plus com...Show more
    Last updated: 29 days ago
    Continuous Improvement Leader

    Continuous Improvement Leader

    VirtualVocations • Whittier, California, United States
    Full-time
    A company is looking for a Continuous Improvement Leader & Process Engineer.Key Responsibilities Lead Value Stream Mapping workshops to analyze and optimize workflows Manage improvement projects...Show more
    Last updated: 1 day ago • Promoted
    Lead QA Engineer

    Lead QA Engineer

    VirtualVocations • Carson, California, United States
    Full-time
    A company is looking for a Lead QA Engineer to join their construction technology team remotely.Key Responsibilities Lead hands-on quality assurance efforts for Desktop applications and SaaS plat...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    VirtualVocations • Whittier, California, United States
    Full-time
    A company is looking for a Site Reliability Engineer to join a Cloud Services team in a remote role.Key Responsibilities Serve as a cloud SME for clients, providing expertise in design, architect...Show more
    Last updated: 30+ days ago • Promoted
    Lead Engineer

    Lead Engineer

    VirtualVocations • North Hollywood, California, United States
    Full-time
    A company is looking for a Lead Engineer to oversee delivery across their core product stack and mentor an engineering team. Key Responsibilities Lead a team of 4-6 engineers in full-stack develop...Show more
    Last updated: 30+ days ago • Promoted
    Forward Deployed Engineer

    Forward Deployed Engineer

    VirtualVocations • Whittier, California, United States
    Full-time
    A company is looking for a Forward Deployed Engineer.Key Responsibilities Partner with the Sales team to deliver tailored solutions during active cycles Lead technical onboarding and integration...Show more
    Last updated: 30+ days ago • Promoted
    Staff Systems Reliability Engineer

    Staff Systems Reliability Engineer

    VirtualVocations • Norwalk, California, United States
    Full-time
    A company is looking for a Staff Systems Reliability Engineer.Key Responsibilities Design and implement scalable, fault-tolerant AWS-based infrastructure Develop and maintain CI / CD pipelines and...Show more
    Last updated: 3 days ago • Promoted
    Reliability Engineering Manager

    Reliability Engineering Manager

    FLIR Systems • El Segundo, CA, US
    Full-time
    Teledyne Technologies Incorporated provides enabling technologies for industrial growth markets that require advanced technology and high reliability. These markets include aerospace and defense, fa...Show more
    Last updated: 30+ days ago • Promoted
    Lead Customer Solutions Engineer

    Lead Customer Solutions Engineer

    VirtualVocations • Whittier, California, United States
    Full-time
    A company is looking for a Lead Customer Solutions Engineer, AP1000.Key Responsibilities : Develop and leverage key customer relationships to create proposals that meet customer needs Accountable...Show more
    Last updated: 3 days ago • Promoted
    Operations Safety Engineering Lead

    Operations Safety Engineering Lead

    VirtualVocations • Huntington Beach, California, United States
    Full-time
    A company is looking for an Operations Safety Engineering Lead for Autonomous Vehicle Development.Key Responsibilities Provide end-to-end safety leadership for AV operations, including manual, su...Show more
    Last updated: 21 days ago • Promoted
    Deployment Engineer

    Deployment Engineer

    VirtualVocations • Norwalk, California, United States
    Full-time
    A company is looking for a Forward Deployment Engineer to work directly with customers and enhance deployment processes.Key Responsibilities Deploy and integrate Revic's AI platform with customer...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineering Manager

    Site Reliability Engineering Manager

    VirtualVocations • North Hollywood, California, United States
    Full-time
    A company is looking for a Site Reliability Engineering Manager to lead their Site Reliability Engineering team.Key Responsibilities Lead and mentor a team of SREs, promoting growth and collabora...Show more
    Last updated: 30+ days ago • Promoted
    Senior Technical Operations Engineer

    Senior Technical Operations Engineer

    VirtualVocations • Long Beach, California, United States
    Full-time
    A company is looking for a Senior Technical Operations Engineer focused on Agentic AI solutions.Key Responsibilities Lead the design, development, and deployment of AI applications within the cor...Show more
    Last updated: 1 day ago • Promoted
    Lead Setup Specialist

    Lead Setup Specialist

    E-Solutions • Los Angeles, CA, US
    Full-time
    Position Title : Lead Setup Specialist.Show more
    Last updated: 30+ days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    VirtualVocations • Glendale, California, United States
    Full-time
    A company is looking for a Senior Site Reliability Engineer.Key Responsibilities Maintain scalable, secure, and reliable cloud services to ensure system operations within Service Level Objectives...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Manager

    Site Reliability Manager

    VirtualVocations • Pasadena, California, United States
    Full-time
    A company is looking for a Manager, SRE to lead engineering teams in building a reliable and secure identity platform.Key Responsibilities Lead and manage teams responsible for cloud infrastructu...Show more
    Last updated: 2 days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    First Resonance • Los Angeles, CA, US
    Full-time
    As a Senior Site Reliability Engineer at First Resonance, you will play a pivotal role in enhancing the efficiency, scalability, and reliability of our software solutions.Joining the core Engineeri...Show more
    Last updated: 10 days ago • Promoted
    Azure Engagement Lead

    Azure Engagement Lead

    VirtualVocations • Long Beach, California, United States
    Full-time
    A company is looking for an Engagement Lead specializing in Azure, to oversee client relationships and drive successful delivery of cybersecurity consulting engagements.Key Responsibilities Overs...Show more
    Last updated: 4 days ago • Promoted