Talent.com
Senior Site Reliability Engineer

Senior Site Reliability Engineer

CorelightSan Francisco, CA, United States
4 days ago
Job type
  • Full-time
Job description

Senior Site Reliability Engineer role at Corelight

We are looking for a Senior Site Reliability Engineer to design, automate, and scale cloud and hybrid platforms that power AI / ML workloads and SaaS services. You\'ll collaborate with engineering teams to build reliable, secure, and observable infrastructure, manage Kubernetes environments, and enable CI / CD pipelines for continuous delivery of AI models and applications at scale. Your expertise in cloud, DevOps, and MLOps will drive performance, uptime, and innovation across production systems.

Responsibilities

  • Design, deploy, and scale AI / ML / LLM infrastructure across cloud platforms (AWS, Azure, or GCP) ensuring high reliability and performance.
  • Manage and optimize Kubernetes environments (EKS, AKS, GKE) for AI services, data pipelines, and model operations.
  • Build and automate end-to-end data and model pipelines for fine-tuning, inference, and RAG workloads using Terraform, Python, and CI / CD tooling.
  • Utilize automation tools such as GitOps, CI / CD pipelines, and containerization technologies (Docker, Kubernetes) to streamline ML / LLM tasks across the Large Language Model lifecycle.
  • Implement monitoring, observability, and reliability best practices using Prometheus, Grafana, ELK / EFK, Langfuse, and SLI / SLO / SLA frameworks.
  • Lead incident response, performance tuning, and cost optimization across AI infrastructure and production workloads.

Minimum Qualifications

  • Bachelor\'s or Master\'s degree in Computer Science, Engineering, or related field, or equivalent experience.
  • 6+ years in SRE, DevOps, Platform Engineering, MLOps, or Cloud Infrastructure roles.
  • 3+ years building software infrastructure in a distributed systems architecture environment.
  • 3+ years of production experience with Kubernetes (EKS, GKE, AKS) and containerization tools like Docker.
  • Strong programming skills in Python and proficiency in Bash, Go, or PowerShell.
  • Proficiency with Infrastructure-as-Code tools (Terraform, CloudFormation).
  • Experience with Kubernetes Operators, Helm, GitOps (ArgoCD, Flux), or Service Mesh (Istio, Linkerd).
  • Exposure to serverless compute (AWS Lambda, Azure Functions).
  • Experience building or automating data and model pipelines for AI / ML / LLM workloads (e.g., RAG, fine-tuning, inference).
  • Strong understanding of observability and monitoring using Prometheus, Grafana, ELK / EFK, Langfuse, or similar platforms.
  • Familiarity with SLI / SLO / SLA practices, incident response, and reliability engineering in production environments.
  • Nice to Have

  • Cloud certifications (AWS, Azure, or GCP – e.g., Solutions Architect, DevOps Engineer).
  • Experience with agentic AI frameworks (CrewAI, LangGraph, AutoGen)
  • Work with vector databases and RAG frameworks (Pinecone, Weaviate, Chroma).
  • Background in hybrid or on-prem AI deployments, including OpenShift or Rancher.
  • Familiarity with configuration management (Ansible, Chef, Puppet).
  • Contributions to open-source AI / ML, DevOps, or platform tooling.
  • Experience with multimodal AI or model observability platforms (RAGAS, AgentOps, Langtrace), Distributed Tracing, OpenTelemetry
  • Knowledge of performance tuning, cost efficiency, or capacity planning for AI / LLM infrastructure.
  • Understanding of security controls and FedRAMP compliance for cloud and various workloads.
  • Compensation information : The compensation for this position may vary depending on factors such as location, skills and experience. Equity and additional benefits will also be awarded. Compensation Range : $142,000—$176,000 USD

    We are looking forward to connecting with you. For more information, visit www.corelight.com

    #J-18808-Ljbffr

    Create a job alert for this search

    Senior Site Reliability Engineer • San Francisco, CA, United States

    Related jobs
    • Promoted
    Senior Site Reliability Engineer (Cortex)

    Senior Site Reliability Engineer (Cortex)

    Palo Alto NetworksSanta Clara, California, United States
    Full-time
    At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and m...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Tarana WirelessMilpitas, California, United States
    Full-time
    Join the Team That's Redefining Wireless Technology.Our groundbreaking Fixed Wireless Access technology is delivering .Senior Site Reliability Engineer. You will work on a team and be a main point o...Show moreLast updated: 30+ days ago
    • Promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    FortinetSanta Clara, CA, United States
    Full-time
    At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer (Senior SRE)

    Senior Site Reliability Engineer (Senior SRE)

    CiroosPleasanton, California, United States
    Full-time
    Ciroos (pronounced “Sai rose”) is a seed-stage startup founded in February 2025 by a team of experienced executives and distinguished engineers with deep expertise in observability, AI, distributed...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantumPalo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    HiveSan Francisco, California, United States
    Full-time
    Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PsiquantumPalo Alto, California, United States
    Full-time
    Quantum computing holds the promise of humanity’s mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    XaiPalo Alto, California, United States
    Full-time
    AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...Show moreLast updated: 30+ days ago
    • Promoted
    Sr. Site Reliability Engineer

    Sr. Site Reliability Engineer

    ProsperSan Francisco, California, United States
    Full-time
    As a Senior Site Reliability Engineer (SRE) at Prosper, you will be instrumental in enhancing the reliability, scalability, and maintainability of our technology platform.This role bridges the gap ...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    NatcastSunnyvale, California, United States
    Full-time
    Natcast (short for The National Center for the Advancement of Semiconductor Technology) is a new, purpose-built, non-profit entity created to operate the National Semiconductor Technology Center (N...Show moreLast updated: 30+ days ago
    • Promoted
    Sr. Site Reliability Engineer

    Sr. Site Reliability Engineer

    Pure StorageSanta Clara, California, United States
    Full-time
    We’re in an unbelievably exciting area of tech and are fundamentally reshaping the data storage industry.Here, you lead with innovative thinking, grow along with us, and join the smartest team in t...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    CrusoeSan Francisco, California, United States
    Full-time
    Crusoe is building the World’s Favorite AI-first Cloud infrastructure company.We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to ...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Loft Orbital SolutionsSan Francisco, California, United States
    Full-time
    Loft Orbital builds a space infrastructure providing a fast & simple path to orbit.We operate satellites, fly customer payloads onboard and handle the entire mission from initial concept through in...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    CheckrSan Francisco, California, United States
    Full-time
    Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ReplitFoster City, California, United States
    Full-time
    Replit is the fastest way to turn ideas into software.With our powerful AI-powered Agent and Assistant, anyone can create and launch apps from natural language in just one click.Build and deploy fu...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CheckrSan Francisco, California, United States
    Full-time
    Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer - Supercomputing

    Site Reliability Engineer - Supercomputing

    XaiPalo Alto, California, United States
    Full-time
    AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...Show moreLast updated: 30+ days ago
    • Promoted
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    VisaFoster City, California, United States
    Full-time
    Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more t...Show moreLast updated: 30+ days ago