Talent.com
Senior Product Manager - Observability and Resilience

Senior Product Manager - Observability and Resilience

NVIDIASanta Clara, CA, US
30+ days ago
Job type
  • Full-time
Job description

Product Manager For Resiliency And Observability

NVIDIA has become the platform upon which every new AI-powered application is built. From healthcare research applications to autonomous vehicles, or voice-recognition systems, there is a need to simplify and deliver predictability for AI applications and workflows ... and NVIDIA is right in the center of this revolution. Resiliency and Observability are key to delivering customer value and exhilarating customer experience. This product manager will lead the development of foundational tools dedicated to ensuring the resiliency and observability of large-scale accelerated computing platforms. By creating essential tools for system diagnostics, performance monitoring, and automated recovery, they will empower customers to confidently operate both complex AI training and demanding inference workloads with maximum uptime and efficiency.

What You Will Be Doing :

  • Be a subject-matter expert on resiliency and observability. Deeply understand failure modes across the GPU hardware, network, and software stack, along with the telemetry signals that reveal them, and how they correlate to workload health and SLOs. Master modern reliability architectures. Keep up-to-date with the industry trends.
  • Build for all that want to use. Drive joint project planning. Define concrete achievements, tasks, and work for resiliency and observability initiatives with external partners.
  • Fuel innovation in reliability tooling. Lead ideation sessions to propose novel approaches and shape new proof-of-concepts.
  • Bridge development, SRE, and partner teams. Facilitate clear communication, triage emergent issues rapidly, and ensure feedback loops between engineering and customer operations remain tight.
  • Coordinate execution across different functions. Work with engineering, design, operations, sales, and marketing to embed resiliency and observability requirements into every product launch, capacity expansion, and lifecycle transition.

What We Need To See :

  • BS or MS in Computer Science, Computer Engineering, or a related field (or equivalent experience) and 12+ years of product-management experience in enterprise technology.
  • Experience with GPU observability (DCGM, NVML, etc.) and integration into large-scale telemetry systems.
  • Deep knowledge of AI / ML infrastructure, high-performance computing (HPC), networking, and cloud technologies (IaaS, PaaS) including containerization, Kubernetes, and automation tools.
  • Familiarity with modern observability stacks : metrics, logs, traces, OpenTelemetry, Prometheus / Grafana, ELK / OpenSearch.
  • Experience building and preferably deep understanding of secure, compliance-focused telemetry pipelines (SOC2, FedRAMP).
  • Ability to articulate trade-offs among latency, throughput, cost, and reliability to both engineering and executive audiences.
  • Data-driven approach : defines SLIs / SLOs, manages error budgets, and develops value models.
  • Strong cross-functional execution : writes clear specs and PRDs, produces GTM collateral, and leads agile processes.
  • Ways To Stand Out From The Crowd :

  • Masters / PhD or expertise in distributed systems, performance modeling, or fault-tolerant computing.
  • Experience with MLOps and LLMOps ecosystems and integrating with enterprise platforms; deployments at modern data-center scale; delivered ML / AI observability solutions for LLMOps, predictive incident detection, or anomaly classification.
  • Startup or 0 ->
  • 1 experience building cloud-native observability or resilience tools; proven success bringing open-source observability products to market and shaping GTM strategy.

  • Familiarity with MLOps toolchains and integrations with monitoring platforms such as Splunk, Datadog, and Grafana Cloud.
  • Expertise with containerization technologies like Docker and Kubernetes, plus virtualization. Proficiency in network architecture and high-performance interconnects (InfiniBand, Ethernet, RoCE).
  • We have some of the most forward-thinking and hardworking people in the world working for us and, due to outstanding growth, our elite engineering teams are growing fast. NVIDIA is widely considered to be one of the industry's most desirable employers. NVIDIA is at the center of Deep Learning, Artificial Intelligence, and Autonomous Vehicles. If you're looking for a challenge, thrives in an ambiguous environment and shares our passion for technology, we want to hear from you. We are looking for great people to help us accelerate the next wave of artificial intelligence.

    Applications for this job will be accepted at least until August 21, 2025. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

    Create a job alert for this search

    Product Manager • Santa Clara, CA, US

    Related jobs
    • Promoted
    • New!
    Senior Product Manager

    Senior Product Manager

    KarbonSan Francisco, CA, US
    Full-time
    Karbon is the global leader in practice management software for growth-minded accounting firms.We provide an award-winning, highly collaborative cloud platform that streamlines work and communicati...Show moreLast updated: 7 hours ago
    • Promoted
    Senior Product Manager, Monitoring Imagery

    Senior Product Manager, Monitoring Imagery

    Planet Labs PBCSan Francisco, CA, United States
    Full-time
    We believe in using space to help life on Earth.Planet designs, builds, and operates the largest constellation of imaging satellites in history. This constellation delivers an unprecedented dataset ...Show moreLast updated: 3 days ago
    • Promoted
    Senior Product Manager (Future Opportunities)

    Senior Product Manager (Future Opportunities)

    TwitterSan Francisco, CA, US
    Full-time
    Senior Product Manager (Future Opportunities).Twitter promotes and protects the public conversation.Twitter is the town square of the internet. At Twitter, we work with one goal in mind : to improve ...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Product Manager (Remote)

    Senior Product Manager (Remote)

    ImplyBurlingame, CA, US
    Remote
    Full-time
    At Imply, our mission is to empower people and organizations to achieve more with their data.We believe that better insights lead to better decisions, and that the right technology can remove barri...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Product Manager

    Senior Product Manager

    AthenaPalo Alto, CA, US
    Full-time
    This person should be excited by the opportunity to work across both.Consumer-facing SaaS, Cybersecurity, and Next Gen Firewall. NGFW, Next Gen Firewall, SASE, ZTNA, SWGm and Threat Prevention).High...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Product Manager

    Senior Product Manager

    SimplyInsuredSan Francisco, CA, US
    Full-time
    At SimplyInsured we are on a mission to eliminate fear in health insurance.Health insurance is complicated, expensive, and really important - so it tends to create fear for most people; our goal is...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Director, Product Management, Digital Engagement

    Senior Director, Product Management, Digital Engagement

    Five9San Ramon, CA, US
    Full-time
    Join us in bringing joy to customer experience.Five9 is a leading provider of cloud contact center software, bringing the power of cloud innovation to customers worldwide.Living our values everyday...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Product Manager - Autonomy

    Senior Product Manager - Autonomy

    Applied IntuitionMountain View, CA, US
    Full-time
    Senior Product Manager - Autonomy.Applied Intuition is the vehicle intelligence company that accelerates the global adoption of safe, AI-driven machines. Founded in 2017, Applied Intuition delivers ...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Product Manager (Vulnerability Management)

    Senior Product Manager (Vulnerability Management)

    Palo Alto NetworksSanta Clara, CA, US
    Full-time
    At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer a...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Product Manager, Wayfinder

    Senior Product Manager, Wayfinder

    Sofar OceanSan Francisco, CA, US
    Full-time
    Senior Product Manager, Wayfinder.Sofar is on a mission to connect the world's oceans.We design, build, and deploy the largest privately owned network of marine weather sensors to power the world's...Show moreLast updated: 18 days ago
    • Promoted
    Senior Product Manager, Platform

    Senior Product Manager, Platform

    FloQastSan Jose, CA, US
    Full-time
    Senior Product Manager, Platform.FloQast is at the forefront of the accounting industry, providing an AI-powered Accounting Transformation Platform created by accountants, for accountants.We are de...Show moreLast updated: 17 days ago
    • Promoted
    Senior Product Manager - New Products

    Senior Product Manager - New Products

    SamsaraSan Francisco, CA, United States
    Full-time
    Samsara (NYSE : IOT) is the pioneer of the Connected Operations™ Cloud, which is a platform that enables organizations that depend on physical operations to harness Internet of Things (IoT) data to ...Show moreLast updated: 1 day ago
    • Promoted
    Senior Product Manager, Achieve

    Senior Product Manager, Achieve

    StravaSan Francisco, CA, US
    Full-time
    Senior Product Manager, Achieve.Strava is the leading social platform for athletes and the largest sports community in the world, with over 150 million athletes in 185 countries.If you sweat you're...Show moreLast updated: 30+ days ago
    • Promoted
    Sr. Manager of Product Management, Cyber Resiliency

    Sr. Manager of Product Management, Cyber Resiliency

    Pure StorageSanta Clara, CA, US
    Full-time
    Manager Of Product Management, Cyber Resiliency.We're in an unbelievably exciting area of tech and are fundamentally reshaping the data storage industry. Here, you lead with innovative thinking, gro...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Product Manager

    Senior Product Manager

    BitGoSan Francisco, CA, US
    Full-time
    BitGo is the leading infrastructure provider of digital asset solutions, delivering custody, wallets, staking, trading, financing, and settlement services from regulated cold storage.Since our foun...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Product Manager

    Senior Product Manager

    Abby CareSan Francisco, CA, US
    Full-time
    At Abby Care, we are tackling one of the most important and unsolved challenges of our time : family caregiving.Over 50 million Americans are family caregivers for loved ones without pay, tools, or ...Show moreLast updated: 23 days ago
    • Promoted
    Senior Product Manager Digital Health Team

    Senior Product Manager Digital Health Team

    Cypress HCMMountain View, CA, US
    Full-time
    Senior Product Manager – Digital Health Team.Create novel and innovative product strategy and product concepts that contribute to a robust product development pipeline.Collaborate with resear...Show moreLast updated: 24 days ago
    • Promoted
    Senior Product Manager, AI-Powered Engagement - Contractor

    Senior Product Manager, AI-Powered Engagement - Contractor

    Five9San Ramon, CA, US
    Full-time
    Join us in bringing joy to customer experience.Five9 is a leading provider of cloud contact center software, bringing the power of cloud innovation to customers worldwide.Living our values everyday...Show moreLast updated: 30+ days ago