Overview
ML Ops Engineer — Agentic AI Lab (Founding Team) at Fabrion
Location : San Francisco Bay Area
Type : Full-Time
Compensation : Competitive salary + meaningful equity (founding tier)
Backed by 8VC, we\'re building a world-class team to tackle one of the industry’s most critical infrastructure problems. Our AI Lab is pioneering the future of intelligent infrastructure through open-source LLMs, agent-native pipelines, retrieval-augmented generation (RAG), and knowledge-graph-grounded models. We’re hiring an ML Ops Engineer to be the glue between ML research and production systems — responsible for automating the model training, deployment, versioning, and observability pipelines that power our agents and AI data fabric.
You’ll work across compute orchestration, GPU infrastructure, fine-tuned model lifecycle management, model governance, and security.
Responsibilities
- Build and maintain secure, scalable, and automated pipelines for :
- LLM fine-tuning, SFT, LoRA, RLHF, DPO training
- RAG embedding pipelines with dynamic updates
- Model conversion, quantization, and inference rollout
- Manage hybrid compute infrastructure (cloud, on-prem, GPU clusters) for training and inference workloads using Kubernetes, Ray, and Terraform
- Containerize models and agents using Docker, with reproducible builds and CI / CD via GitHub Actions or ArgoCD
- Implement and enforce model governance : versioning, metadata, lineage, reproducibility, and evaluation capture
- Create and manage evaluation and benchmarking frameworks (e.g. OpenLLM-Evals, RAGAS, LangSmith)
- Integrate with security and access control layers (OPA, ABAC, Keycloak) to enforce model policies per tenant
- Instrument observability for model latency, token usage, performance metrics, error tracing, and drift detection
- Support deployment of agentic apps with LangGraph, LangChain, and custom inference backends (e.g. vLLM, TGI, Triton)
Qualifications
Model Infrastructure
4+ years in MLOps, ML platform engineering, or infra-focused ML rolesDeep familiarity with model lifecycle management tools : MLflow, Weights & Biases, DVC, HuggingFace HubExperience with large model deployments (open-source LLMs preferred) : LLaMA, Mistral, Falcon, MixtralComfortable with tuning libraries (HuggingFace Trainer, DeepSpeed, FSDP, QLoRA)Familiarity with inference serving : vLLM, TGI, Ray Serve, Triton Inference ServerAutomation + Infra
Proficient with Terraform, Helm, Kubernetes, and container orchestrationExperience with ML CI / CD (e.g. GitHub Actions + model checkpoints)Managed hybrid workloads across GPU cloud (Lambda, Modal, HuggingFace Inference, SageMaker)Familiar with cost optimization (spot instance scaling, batch prioritization, model sharding)Agent + Data Pipeline Support
Familiarity with LangChain, LangGraph, LlamaIndex or similar RAG / agent orchestration toolsBuilt embedding pipelines for multi-source documents (PDF, JSON, CSV, HTML)Integrated with vector databases (Weaviate, Qdrant, FAISS, Chroma)Security & Governance
Implemented model-level RBAC, usage tracking, audit trails; API rate limits, tenant billing, and SLA observabilityExperience with policy-as-code systems (OPA, Rego) and access layersPreferred Tech Stack
LLM Ops : HuggingFace, DeepSpeed, MLflow, Weights & Biases, DVCInfra : Kubernetes (GKE / EKS), Ray, Terraform, Helm, GitHub Actions, ArgoCDServing : vLLM, TGI, Triton, Ray ServePipelines : Prefect, Airflow, DagsterMonitoring : Prometheus, Grafana, OpenTelemetry, LangSmithSecurity : OPA (Rego), Keycloak, VaultLanguages : Python (primary), Bash; Rust or Go for tooling (optional)Mindset & Culture Fit
Builder\'s mindset with startup autonomy : you automate what slows you downObsessive about reproducibility, observability, and traceabilityComfortable with a hybrid team of AI researchers, DevOps, and backend engineersInterested in aligning ML systems to product delivery, not just papersBonus : experience with SOC2, HIPAA, or GovCloud-grade model operationsWhat We’re Looking For
5+ years as a full stack or backend engineerExperience owning and delivering production systems end-to-endPrior experience with modern frontend frameworks (React, Next.js)Familiarity with building APIs, databases, cloud infrastructure, or deployment workflows at scaleComfortable working in early-stage startups or autonomous roles; prior founder or founding engineer experience is a big plusWhy This Role Matters
Your work will enable models and agents to be trained, evaluated, deployed, and governed at scale — across many tenants, models, and tasks. This is the backbone of a secure, reliable, and scalable AI-native enterprise system.
Seniority level
Mid-Senior levelEmployment type
Full-timeJob function
Engineering and Information TechnologyIndustries : Technology, Information and Internet#J-18808-Ljbffr