Talent.com
No se aceptan más aplicaciones
Principal Engineer - AI Infrastructure Abstractions

Principal Engineer - AI Infrastructure Abstractions

Diversity Talent ScoutsSan Jose, CA, United States
Hace 18 días
Tipo de contrato
  • A tiempo completo
Descripción del trabajo

As a Principal AI Infrastructure Abstraction Engineer , you will design and implement the foundational systems that make shared AI compute environments scalable, secure, and developer-friendly. Your work will focus on creating abstractions that hide hardware complexity while providing predictable, cloud-native interfaces for AI workloads.

This position bridges infrastructure and applied AI-turning raw GPUs and accelerators into programmable, elastic, and multi-tenant resources for both internal developers and enterprise clients.

Key Responsibilities

  • Architect abstractions that map logical compute constructs (vGPUs, GPU pools, workload queues) to physical devices.
  • Build APIs, services, and control planes that expose GPU and accelerator resources with strong isolation and quality-of-service guarantees.
  • Develop mechanisms for secure GPU sharing, including time-slicing, partitioning, and namespace isolation.
  • Work with orchestration and scheduling systems to ensure intelligent mapping of resources based on utilization, priority, and network topology.
  • Define policies for quotas, fair allocation, and resource elasticity in shared environments.
  • Integrate with AI / ML frameworks (PyTorch, TensorFlow, Triton, etc.) to optimize model training and inference workflows.
  • Deliver observability and monitoring capabilities that trace resource usage from logical abstractions to hardware.
  • Partner with platform security teams to strengthen access controls, onboarding processes, and tenant isolation.
  • Support internal developer adoption of abstraction APIs while maintaining high performance and low overhead.
  • Contribute to long-term compute platform strategy with a focus on modularity, abstraction, and scale.

Minimum Qualifications

  • Bachelor's degree with 15+ years of experience, Master's with 12+ years, or PhD with 8+ years.
  • Proven track record building production-grade infrastructure systems, preferably in Go, Python, or C++.
  • Strong experience with containerization and orchestration platforms (Kubernetes, Docker, KubeVirt).
  • Background in designing logical abstractions for compute, storage, or networking in multi-tenant systems.
  • Familiarity with integrating with machine learning platforms (e.g., PyTorch, TensorFlow, Triton, MLFlow).
  • Preferred Qualifications

  • Hands-on experience with GPU sharing, scheduling, or isolation (MIG, MPS, vGPUs, time-slicing, or device plugin models).
  • Deep knowledge of resource management : quotas, prioritization, fairness, elasticity.
  • Strong ability to think across hardware / software boundaries and design abstractions that scale.
  • Crear una alerta de empleo para esta búsqueda

    Principal Engineer Ai • San Jose, CA, United States