Location : San Jose, CA
Job Description : AI / ML Model Deployment Specialist / Production MLOps Engineer
Overview
We are seeking a highly skilled and specialized AI / ML Model Deployment Specialist / Production MLOps Engineer. The core focus of this role is to take existing machine learning models and optimize their deployment and serving infrastructure for high-performance, production-ready inference, with a strong emphasis on leveraging state-of-the-art AI model serving technologies.
Responsibilities
Model-Specific Optimization :
Analyze and understand the underlying logic and dependencies of various AI / ML models (primarily using PyTorch and TensorFlow) to identify bottlenecks in the inference pipeline.
High-Performance Serving Implementation :
Design, implement, and manage high-performance inference serving solutions utilizing specialized inference servers (e.g., vLLM) to achieve low latency and high throughput.
GPU Utilization Optimization :
Optimize model serving configurations specifically for GPU hardware to maximize resource efficiency and performance metrics in a production environment.
Containerization for Deployment :
Create minimal, secure, and production-ready Docker images for streamlined deployment of optimized models and inference servers across various environments.
Collaboration :
Work closely with core engineering and data science teams to ensure a smooth transition from model development to high-scale production deployment.
Required Skillsets
AI / ML Domain Expertise
Deep understanding of the AI / ML domain, with the core effort centered around model performance and serving, rather than general infrastructure.
ML Frameworks
Expertise in PyTorch and TensorFlow :
Proven ability to work with and troubleshoot model-specific dependencies, logic, and graph structures within these major frameworks.
Inference Optimization
Production Inference Experience :
Expertise in designing and implementing high-throughput, low-latency model serving solutions.
Specialized Inference Servers :
Mandatory experience with high-performance inference servers, specifically including
vLLM , or similar dedicated LLM serving frameworks.
GPU Optimization :
Demonstrated ability to optimize model serving parameters and infrastructure to maximize performance on NVIDIA or equivalent GPU hardware.
Deployment and Infrastructure
Containerization (Docker) :
Proficiency in creating minimal, secure, and efficient Docker images for model and server deployment.
Infrastructure Knowledge (Helpful, but Secondary) :
General knowledge of cloud platforms (AWS, GCP, Azure) and Kubernetes / orchestration is beneficial but the primary focus remains on model serving and optimization.
Qualifications
Bachelor's or Master's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
8+ years of experience in MLOps, Machine Learning Engineering, or a specialized Inference Optimization role.
A portfolio or project experience demonstrating successful deployment of high-performance, containerized ML models to production scale.
Associate Specialist • San Jose, CA, United States