Senior Engineer - AI and HPC ObservabilityNVIDIA • Santa Clara, CA, United States

Senior Engineer - AI and HPC Observability

NVIDIA • Santa Clara, CA, United States

1 day ago

Job type

Full-time

Job description

NVIDIA is a pioneer in accelerated computing, known for inventing the GPU and driving breakthroughs in gaming, computer graphics, high-performance computing, and artificial intelligence. Our technology powers everything from generative AI to autonomous systems, and we continue to shape the future of computing through innovation and collaboration. Within this mission, our team, Managed AI Superclusters (MARS) builds and scales the infrastructure, platforms, and tools that enable researchers and engineers to develop the next generation of AI / ML systems. By joining us, you’ll help design solutions that power some of the world’s most advanced computing workloads.

Observability is at the heart of this transformation. We are looking for a Senior AI & HPC Observability Engineer to design and build the next-generation observability platform for large-scale AI workloads, GPU clusters, and high-performance computing environments. This role blends deep technical engineering with large-scale data systems and developing scalable telemetry pipelines, AI-driven insights, and intelligent monitoring across NVIDIA’s world-class GPU infrastructure.

What You Will Be Doing :

Design and implement full-stack observability systems covering metrics, logs, traces, and events for GPU-powered AI and HPC workloads.

Build large-scale telemetry data pipelines leveraging OpenTelemetry, Kafka, Prometheus, and other distributed systems to ingest, process, and analyze massive data streams.

Develop analytics and anomaly detection frameworks to enable real-time visibility, performance optimization, and predictive insights across multi-tenant environments.

Architect and tune high-throughput data stores (e.g., TSDBs, columnar databases, OLAP systems) for large-scale observability data.

Drive self-service analytics capabilities through APIs, dashboards, and recommendation engines that empower developers and operators with actionable insights.

Collaborate with AI platform, GPU, and cloud infrastructure teams to optimize observability for model training, inference workloads, and HPC performance.

Leverage machine learning and statistical techniques for correlation, anomaly detection, and intelligent alerting.

Contribute to performance tuning, scalability, and reliability of observability services across on-prem, and cloud environments.

What We Need To See :

BS or equivalent experience in Computer Science, Computer Engineering, or a related technical field.

8+ years of experience in large-scale observability, data engineering, or performance monitoring systems.

Proven expertise in building and scaling observability stacks (metrics, logs, traces, events) using OpenTelemetry, Prometheus, Grafana, or Thanos .

Deep understanding of data collection, transformation, and storage at scale, experience with streaming frameworks (Kafka, Flink, Spark) preferred.

Hands-on experience with Python, Go, and / or Java for backend development and automation.

Strong knowledge of API design, data modeling, SQL / NoSQL , and data pipeline architecture.

Experience working with PromQL , time-series databases, and large-scale monitoring systems.

Familiarity with AI / ML pipelines, GPU-based workloads , and HPC environments.

Experience with anomaly detection, log analytics, and recommendation systems using ML or statistical techniques.

Excellent problem-solving, debugging, and performance-tuning skills in distributed systems.

Ways To Stand Out from The Crowd :

Proven experience designing and scaling full-stack observability platforms for large-scale AI, GPU, or HPC environments.

Hands-on expertise with OpenTelemetry , Prometheus , Kafka , and distributed data pipelines handling high-volume telemetry streams.

Strong background in data engineering, performance tuning, and time-series data modeling for real-time analytics.

Demonstrated use of machine learning or statistical techniques for anomaly detection, correlation, or intelligent alerting.

Deep understanding of API design , self-service observability, and building platforms that empower internal developers and operators.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits () .

Applications for this job will be accepted at least until October 24, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Create a job alert for this search

Engineer Observability • Santa Clara, CA, United States

Related jobs

Payer Analytics Consultant

Central California Alliance for Health • Scotts Valley, CA, United States

Full-time +1

We have an opportunity to join the Alliance as a Payer Analytics Consultant in the Payment Strategy Department.There are two positions that can be filled as a Payer Analytics Consultant or Senior P...Show more

Last updated: 10 days ago • Promoted

AI / ML Architect

Cooley LLP • Palo Alto, CA, United States

Full-time

Cooley is seeking an AI / ML Architect to join the Practice Engineering team within the Innovation department.As a leading technology law firm, Cooley is determined to become a leader in the digital ...Show more

Last updated: 20 days ago • Promoted

AI Performance Engineer

Parasail • San Francisco, CA, United States

Full-time

Parasail is redefining AI infrastructure by enabling seamless deployment across a distributed network of GPUs, optimizing for cost, performance, and flexibility. Our mission is to empower AI develop...Show more

Last updated: 30+ days ago • Promoted

AI Research Scientist / Engineer

Phizenix • Menlo Park, CA, US

Full-time +1

AI Research Scientist / Engineer.Menlo Park, CA | On-Site | Full-Time / Direct Hire .Seeking top-tier PhDs (Bay Area preferred) with ICML / ICLR publications in LLM training and inference optimizati...Show more

Last updated: 30+ days ago • Promoted

Research Engineer, AI Safety & Alignment

Character • Redwood City, CA, United States

Full-time

Joining us as a Research Engineer, you'll be at the forefront of tackling one of the most critical challenges in AI today : safety and alignment. Your work will be pivotal in understanding and mitiga...Show more

Last updated: 1 day ago • Promoted

Senior / Principal Database Architect / Administrator, Onsite

Sandia National Laboratories • Livermore, CA, United States

Full-time +1

Sandia National Laboratories is the nation's premier science and engineering lab for national security and technology innovation, with teams of specialists focused on cutting-edge work in a broad a...Show more

Last updated: 25 days ago • Promoted

ML Infrastructure Engineer

Phizenix • Menlo Park, CA, US

Full-time +1

Menlo Park, CA | On-Site | Full-Time / Direct Hire.Client Opportunity | Through Phizenix.Phizenix, a certified minority and women-led recruiting firm, is hiring on behalf of an AI startup pioneering ...Show more

Last updated: 30+ days ago • Promoted

S. Scott Collis Fellowship in Data Science 2026

Sandia National Laboratories • Livermore, CA, United States

Full-time +1

Last updated: 25 days ago • Promoted

Senior Payer Analytics Consultant

Central California Alliance for Health • Scotts Valley, CA, United States

Full-time +1

We have an opportunity to join the Alliance as a Senior Payer Analytics Consultant in the Payment Strategy Department.There are two positions that can be filled as a Senior Payer Analytics Consulta...Show more

Last updated: 30+ days ago • Promoted

Research Engineer / Scientist, Trustworthy AI

Openai • San Francisco, CA, United States

Full-time

The Safety Systems team is responsible for various safety work to ensure our best models can be safely deployed to the real world to benefit the society and is at the forefront of OpenAI's mission ...Show more

Last updated: 1 day ago • Promoted

Integration Engineer, AI

Figma • San Francisco, CA, United States

Full-time

Figma is growing our team of passionate creatives and builders on a mission to make design accessible to all.Figma's platform helps teams bring ideas to life-whether you're brainstorming, creating ...Show more

Last updated: 1 day ago • Promoted

Autonomous Research Engineer

Retro Biosciences • Redwood City, CA, United States

Full-time

Req# : 948b4e75-bd60-4a68-9cff-133df26c7439.We are looking to expand our exceptional team at our Redwood City location.We are seeking a highly motivated engineer to join our.The research automation ...Show more

Last updated: 1 day ago • Promoted

ML Engineer

Phizenix • Menlo Park, CA, US

Full-time +1

Client Opportunity | Through Phizenix.Phizenix, a certified minority and women-led recruiting firm, is hiring on behalf of an innovative generative AI startup that's developing diffusion-based larg...Show more

Last updated: 30+ days ago • Promoted

GenAI Engineer

Omni Inclusive • Fremont, CA, United States

Full-time

AI / ML engineering with hands-on experience in multimodal models (CLIP, BLIP, Whisper, or similar models).FAISS, Milvus, Weaviate) and embedding pipelines. Analyze the current multimodal indexing pip...Show more

Last updated: 1 day ago • Promoted

Distinguished Engineer, Data, ML & AI, Office of the CTO

Equinix • Redwood City, CA, United States

Full-time

Equinix is the world's digital infrastructure company, shortening the path to connectivity to enable the innovations that enrich our work, life and planet. A place where tech thinkers and future bui...Show more

Last updated: 1 day ago • Promoted

EMC Compliance Engineer I

Element Materials Technology • Morgan Hill, CA, United States

Full-time

Element has an opportunity for a.Regulatory compliance involving EMC / RF Testing for wireless consumer technologies in a fast-paced environment and industry. Performs testing of the latest cellular (...Show more

Last updated: 7 days ago • Promoted

Gil Herrera Fellowship in Quantum Information Science 2026

Sandia National Laboratories • Livermore, CA, United States

Full-time +1

Last updated: 25 days ago • Promoted

Product Engineer (AI)

Eudia • Palo Alto, CA, United States

Full-time

Eudia is redefining the future of legal work with AI-powered Augmented Intelligence, enabling Fortune 500 legal teams to move faster, manage risk more effectively, and unlock new business value.Bac...Show more

Last updated: 1 day ago • Promoted