Talent.com
No se aceptan más aplicaciones
ML Ops Engineer — Agentic AI Lab (Founding Team)

ML Ops Engineer — Agentic AI Lab (Founding Team)

FabrionSan Francisco, CA, United States
Hace 2 días
Tipo de contrato
  • A tiempo completo
Descripción del trabajo

Overview

ML Ops Engineer — Agentic AI Lab (Founding Team) at Fabrion

Location : San Francisco Bay Area

Type : Full-Time

Compensation : Competitive salary + meaningful equity (founding tier)

Backed by 8VC, we\'re building a world-class team to tackle one of the industry’s most critical infrastructure problems. Our AI Lab is pioneering the future of intelligent infrastructure through open-source LLMs, agent-native pipelines, retrieval-augmented generation (RAG), and knowledge-graph-grounded models. We’re hiring an ML Ops Engineer to be the glue between ML research and production systems — responsible for automating the model training, deployment, versioning, and observability pipelines that power our agents and AI data fabric.

You’ll work across compute orchestration, GPU infrastructure, fine-tuned model lifecycle management, model governance, and security.

Responsibilities

  • Build and maintain secure, scalable, and automated pipelines for :
  • LLM fine-tuning, SFT, LoRA, RLHF, DPO training
  • RAG embedding pipelines with dynamic updates
  • Model conversion, quantization, and inference rollout
  • Manage hybrid compute infrastructure (cloud, on-prem, GPU clusters) for training and inference workloads using Kubernetes, Ray, and Terraform
  • Containerize models and agents using Docker, with reproducible builds and CI / CD via GitHub Actions or ArgoCD
  • Implement and enforce model governance : versioning, metadata, lineage, reproducibility, and evaluation capture
  • Create and manage evaluation and benchmarking frameworks (e.g. OpenLLM-Evals, RAGAS, LangSmith)
  • Integrate with security and access control layers (OPA, ABAC, Keycloak) to enforce model policies per tenant
  • Instrument observability for model latency, token usage, performance metrics, error tracing, and drift detection
  • Support deployment of agentic apps with LangGraph, LangChain, and custom inference backends (e.g. vLLM, TGI, Triton)

Qualifications

Model Infrastructure

  • 4+ years in MLOps, ML platform engineering, or infra-focused ML roles
  • Deep familiarity with model lifecycle management tools : MLflow, Weights & Biases, DVC, HuggingFace Hub
  • Experience with large model deployments (open-source LLMs preferred) : LLaMA, Mistral, Falcon, Mixtral
  • Comfortable with tuning libraries (HuggingFace Trainer, DeepSpeed, FSDP, QLoRA)
  • Familiarity with inference serving : vLLM, TGI, Ray Serve, Triton Inference Server
  • Automation + Infra

  • Proficient with Terraform, Helm, Kubernetes, and container orchestration
  • Experience with ML CI / CD (e.g. GitHub Actions + model checkpoints)
  • Managed hybrid workloads across GPU cloud (Lambda, Modal, HuggingFace Inference, SageMaker)
  • Familiar with cost optimization (spot instance scaling, batch prioritization, model sharding)
  • Agent + Data Pipeline Support

  • Familiarity with LangChain, LangGraph, LlamaIndex or similar RAG / agent orchestration tools
  • Built embedding pipelines for multi-source documents (PDF, JSON, CSV, HTML)
  • Integrated with vector databases (Weaviate, Qdrant, FAISS, Chroma)
  • Security & Governance

  • Implemented model-level RBAC, usage tracking, audit trails; API rate limits, tenant billing, and SLA observability
  • Experience with policy-as-code systems (OPA, Rego) and access layers
  • Preferred Tech Stack

  • LLM Ops : HuggingFace, DeepSpeed, MLflow, Weights & Biases, DVC
  • Infra : Kubernetes (GKE / EKS), Ray, Terraform, Helm, GitHub Actions, ArgoCD
  • Serving : vLLM, TGI, Triton, Ray Serve
  • Pipelines : Prefect, Airflow, Dagster
  • Monitoring : Prometheus, Grafana, OpenTelemetry, LangSmith
  • Security : OPA (Rego), Keycloak, Vault
  • Languages : Python (primary), Bash; Rust or Go for tooling (optional)
  • Mindset & Culture Fit

  • Builder\'s mindset with startup autonomy : you automate what slows you down
  • Obsessive about reproducibility, observability, and traceability
  • Comfortable with a hybrid team of AI researchers, DevOps, and backend engineers
  • Interested in aligning ML systems to product delivery, not just papers
  • Bonus : experience with SOC2, HIPAA, or GovCloud-grade model operations
  • What We’re Looking For

  • 5+ years as a full stack or backend engineer
  • Experience owning and delivering production systems end-to-end
  • Prior experience with modern frontend frameworks (React, Next.js)
  • Familiarity with building APIs, databases, cloud infrastructure, or deployment workflows at scale
  • Comfortable working in early-stage startups or autonomous roles; prior founder or founding engineer experience is a big plus
  • Why This Role Matters

    Your work will enable models and agents to be trained, evaluated, deployed, and governed at scale — across many tenants, models, and tasks. This is the backbone of a secure, reliable, and scalable AI-native enterprise system.

    Seniority level

  • Mid-Senior level
  • Employment type

  • Full-time
  • Job function

  • Engineering and Information Technology
  • Industries : Technology, Information and Internet
  • #J-18808-Ljbffr

    Crear una alerta de empleo para esta búsqueda

    Ai Ml Engineer • San Francisco, CA, United States