Talent.com
Software Engineer Lead - Cloud Engineering
Software Engineer Lead - Cloud EngineeringKumo • Mountain View, CA, United States
Software Engineer Lead - Cloud Engineering

Software Engineer Lead - Cloud Engineering

Kumo • Mountain View, CA, United States
30+ days ago
Job type
  • Full-time
Job description

Software Engineer Lead - Cloud Engineering

The Cloud Infrastructure team at Kumo is responsible for managing and scaling our Kubernetes-based, cloud-native AI platform across multiple cloud providers. They set service level objectives, optimize resource allocation, enforce security compliance, and drive cost efficiency for the Multi-Cloud Platform.

As a key team member, you will architect and operate a highly scalable, resilient Kubernetes infrastructure to support massive Big Data and AI workloads. You'll design and implement advanced cluster management strategies, fleet capacity scaling, optimize workload scheduling, and enhance observability at scale. Your expertise in Kubernetes internals, networking, and performance tuning will be critical in ensuring high availability and seamless scaling.

Joining early, you'll play a pivotal role in shaping platform reliability, automating infrastructure, and enabling ML engineers with efficient commit-to-production automation, Continuous Provisioning, CI / CD, ML Ops, and deployment orchestration and workflows. You'll collaborate with ML scientists, product engineers, and leadership to influence scaling strategies, develop self-service tooling, and drive multi-cloud resilience. Engineers at Kumo take ownership of core system design, building infrastructure that powers the next generation of AI applications.

Key Responsibilities

  • Design, build, and scale Kubernetes-based infrastructure to support Kumo's multi-cloud AI platform, ensuring high availability, resilience, and performance.
  • Architect and optimize large-scale Kubernetes clusters, improving scheduling, networking (CNI), and workload orchestration for production environments.
  • Develop and extend Kubernetes controllers and operators to automate cluster management, lifecycle operations, and scaling strategies.
  • Enhance observability, diagnostics, and monitoring by building tools for real-time cluster health tracking, alerting, and performance tuning.
  • Lead efforts to automate fleet management, optimizing node pools, autoscaling, and multi-cluster deployments across AWS, GCP, and Azure.
  • Define and implement Kubernetes security policies, RBAC models, and best practices to ensure compliance and platform integrity.
  • Collaborate with ML engineers and platform teams to optimize Kubernetes for machine learning workloads, ensuring seamless resource allocation for AI / ML models.
  • Drive commit-to-production automation, cloud connectivity, and deployment orchestration, ensuring seamless application rollouts, zero-downtime upgrades, and global infrastructure reliability.

Required Skills and Experience

  • Kubernetes Mastery : 8-10+ years of experience managing large-scale Kubernetes clusters (EKS, GKE, AKS, or OpenSource) in production. Deep expertise in Kubernetes internals, including controllers, operators, scheduling, networking (CNI), and security policies.
  • Cloud-Native Infrastructure : 8-10+ years of experience building cloud-native Kubernetes-based infrastructure across AWS, Azure, and GCP.
  • Platform Engineering : 8-10+ years of experience building Kubernetes service meshes (Istio / Envoy, Traefik), networking policies (Calico / Tigera), and distributed ingress / egress control.
  • Fleet Management & Scaling : Proven experience in optimizing, scaling, and maintaining Kubernetes clusters across multi-cloud environments, ensuring high availability and performance.
  • Software Development : 8-10+ years of experience writing production-grade controllers and operators in Python, Go, or Rust to extend Kubernetes functionality.
  • Infrastructure-as-Code & Automation : Hands-on experience with Terraform, CloudFormation, Ansible, BASH and Make scripting to automate Kubernetes cluster provisioning and management.
  • Distributed Systems & SaaS : Expertise in building and operating large-scale distributed systems for cloud-native B2B SaaS applications running on Kubernetes.
  • Cloud Application Deployment : Deep expertise in building of container orchestration, workload scheduling, and runtime optimizations using Kubernetes, Argo or Flux.
  • Education : BS / MS in Computer Science or a related field (PhD preferred)
  • Nice to Have

  • Proficiency with cloud platforms such as AWS, GCP, or Azure.
  • Familiarity with chaos engineering tools and practices for testing system resilience.
  • Strong understanding of security best practices and compliance standards (GDPR, SOC2, ISO27001, vulnerability assessments, GRC, risk management).
  • Contributions to open-source projects, particularly in the Kubernetes or cloud-native ecosystem.
  • Expertise in Docker, Kubernetes, Jenkins, Flux, Argo, and Terraform in a Linux environment.
  • Hands-on experience with monitoring and observability tools such as Prometheus and Grafana.
  • Ability to develop customer-facing web frontends or public APIs / SDKs for platform services.
  • Benefits

  • Competitive salary and equity options.
  • Comprehensive medical and dental insurance.
  • An inclusive, diverse work environment where all employees are valued and supported.
  • $175,000 - $250,000 a year

    We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

    Create a job alert for this search

    Software Engineer Cloud • Mountain View, CA, United States

    Related jobs
    Principal DevOps Engineer

    Principal DevOps Engineer

    Informatica LLC • Redwood City, CA, United States
    Full-time
    Build Your Career at Informatica.We seek innovative thinkers who believe in the power of data to drive meaningful change. At Informatica, we welcome adventurous minds eager to solve the world's most...Show more
    Last updated: 27 days ago • Promoted
    Senior Software Engineer - Cloud Infrastructure

    Senior Software Engineer - Cloud Infrastructure

    General Motors • Sunnyvale, CA, United States
    Full-time
    At General Motors, our product teams are redefining mobility.Through a human-centered design process, we create vehicles and experiences that are designed not just to be seen, but to be felt.We're ...Show more
    Last updated: 2 days ago • Promoted
    Cloud Engineer (AWS)

    Cloud Engineer (AWS)

    Medium • San Francisco, CA, United States
    Full-time
    Employment Type : Full-Time, Experienced.Department : Information technology.We are seeking a Cloud Engineer (AWS) who will be responsible for supporting the development of all required documentation...Show more
    Last updated: 30+ days ago • Promoted
    Senior Software Engineer - Cloud Infrastructure

    Senior Software Engineer - Cloud Infrastructure

    WeRide.ai • San Jose, CA, United States
    Full-time
    Established in 2017, WeRide (NASDAQ : WRD) is a leading global commercial-stage company that develops autonomous driving technologies from Level 2 to Level 4. WeRide is the only tech company in the w...Show more
    Last updated: 30+ days ago • Promoted
    Senior Software Engineer, Edge & Cloud Integration

    Senior Software Engineer, Edge & Cloud Integration

    Auterion • San Francisco, CA, United States
    Full-time
    Senior Software Engineer – Edge & Cloud Integration is responsible for designing, implementing, and optimizing software that runs on the edge (onboard companion computers and embedded systems) and ...Show more
    Last updated: 9 days ago • Promoted
    Senior Lead DevOps Engineer — Cloud Native & ML-Driven

    Senior Lead DevOps Engineer — Cloud Native & ML-Driven

    Capital One • San Francisco, CA, United States
    Full-time
    A leading financial services company in San Francisco seeks a Senior Lead Software Engineer (DevOps) to drive technology transformation. You will lead projects utilizing machine learning and microse...Show more
    Last updated: 1 day ago • Promoted
    Senior Software Engineer, Cloud Platform

    Senior Software Engineer, Cloud Platform

    Chef Robotics, Inc. • San Francisco, CA, United States
    Full-time
    Chef Robotics is on a mission to accelerate the advent of intelligent machines in the physical world.As the rise of LLMs like ChatGPT has shown, AI has the potential to drive immense change.However...Show more
    Last updated: 30+ days ago • Promoted
    Senior Software Engineer - Hybrid Cloud

    Senior Software Engineer - Hybrid Cloud

    Roblox • San Mateo, CA, United States
    Full-time
    Every day, tens of millions of people come to Roblox to explore, create, play, learn, and connect with friends in 3D immersive digital experiences– all created by our global community of developers...Show more
    Last updated: 8 hours ago • Promoted • New!
    Senior Software Engineer - Cloud Logistics

    Senior Software Engineer - Cloud Logistics

    Nimble Robotics • San Francisco, CA, United States
    Full-time
    Nimble is a frontier robotics and AI company building the next era of autonomous logistics.We design, manufacture, and deploy intelligent robots that enable fast, efficient, and sustainable commerc...Show more
    Last updated: 30+ days ago • Promoted
    Senior Software Engineer, Cloud Infrastructure

    Senior Software Engineer, Cloud Infrastructure

    Nuro • Mountain View, CA, United States
    Full-time
    Senior Software Engineer, Cloud Infrastructure.Nuro is a self-driving technology company on a mission to make autonomy accessible to all. Founded in 2016, Nuro is building the world's most scalable ...Show more
    Last updated: 30+ days ago • Promoted
    Senior Software Engineer – Cloud Data Platform

    Senior Software Engineer – Cloud Data Platform

    Disneyland Hong Kong • San Francisco, CA, United States
    Full-time
    A global entertainment company seeks a Senior Software Engineer to drive data platform innovation.The role involves developing critical tools for engineering teams, ensuring operational excellence,...Show more
    Last updated: 3 days ago • Promoted
    Senior Software Engineer, Cloud Platform

    Senior Software Engineer, Cloud Platform

    Verily Life Sciences • Mountain View, CA, United States
    Full-time
    Verily is a subsidiary of Alphabet that is using a data-driven approach to change the way people manage their health and the way healthcare is delivered. Launched from Google X in 2015, our purpose ...Show more
    Last updated: 30+ days ago • Promoted
    Software Engineer, Cloud Infrastructure

    Software Engineer, Cloud Infrastructure

    OpenAI • San Francisco, CA, United States
    Full-time
    The Applied Engineering team works across research, engineering, product, and design to bring OpenAI’s technology to consumers and businesses. You’ll join the team responsible for running the core i...Show more
    Last updated: 30+ days ago • Promoted
    Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

    Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

    Epoch Biodesign • San Francisco, CA, United States
    Full-time
    We are looking for a highly skilled engineer with deep expertise in building and operating observability platforms at scale. You will design, develop, and run Crusoe’s next-generation observability ...Show more
    Last updated: 30+ days ago • Promoted
    Senior Software Engineer, Cloud Platform

    Senior Software Engineer, Cloud Platform

    Verily • Mountain View, CA, United States
    Full-time
    Senior Software Engineer, Cloud Platform page is loaded## Senior Software Engineer, Cloud Platformremote type : Hybridlocations : Mountain View, Californiatime type : Full timeposted on : Posted Yester...Show more
    Last updated: 8 hours ago • Promoted • New!
    Principal Software Engineer

    Principal Software Engineer

    Informatica LLC • Redwood City, CA, United States
    Full-time
    Build Your Career at Informatica.We seek innovative thinkers who believe in the power of data to drive meaningful change. At Informatica, we welcome adventurous, work-from-anywhere minds eager to so...Show more
    Last updated: 30+ days ago • Promoted
    Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

    Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

    Crusoe Energy Systems LLC • San Francisco, CA, United States
    Full-time
    We are looking for a highly skilled engineer with deep expertise in building and operating observability platforms at scale. You will design, develop, and run Crusoe’s next-generation observability ...Show more
    Last updated: 30+ days ago • Promoted
    Senior Software Engineer, Cloud Functions

    Senior Software Engineer, Cloud Functions

    NVIDIA • Santa Clara, CA, United States
    Full-time
    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern compu...Show more
    Last updated: 1 day ago • Promoted