Talent.com
Foundation Model DevOps Engineer
Foundation Model DevOps EngineerInstitute of Foundation Models • Sunnyvale, CA, US
No se aceptan más aplicaciones
Foundation Model DevOps Engineer

Foundation Model DevOps Engineer

Institute of Foundation Models • Sunnyvale, CA, US
Hace 7 días
Tipo de contrato
  • A tiempo completo
Descripción del trabajo

Job Description

Job Description
About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.



About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.
As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role
We are seeking a Foundation Model DevOps Engineer focused on Operational Stability to serve as the backbone of our AI research infrastructure.
You will be designing the friction-free environment that allows our models to be built. Your mandate is to build the tooling, release pipelines, and storage policies that remove drag on our research team. You will own the "foundational layer", ensuring that our researchers have immediate, secure, and reliable access to the tools, data, and compute they need.

Key Responsibilities

Model Release Engineering
· High-Fidelity Release Management: You own the standard of our public presence. You ensure that every release (weights, code, training logs, data) is reproducible, meticulously documented, and packaged with the polish of a top-tier open-source product. CI/CD for Research: Design and implement pipelines that automate the testing and packaging of complex model releases, moving us away from manual handovers to automated verification.
· Repo Administration: Administer the organization’s GitHub Enterprise account, ensuring branch protection and clean versioning practices are enforced across the lab.

Resource Management & Infrastructure Efficiency
· Compute Governance: Manage the efficiency of our large-scale GPU resources. You track utilization to identify idle nodes, "zombie jobs," or inefficient scheduling, ensuring we extract maximum value from our compute clusters.
· Storage Strategy & Hygiene: Manage the lifecycle of petabyte-scale datasets and checkpoint storage. You implement intelligent aging policies to solve the "disk full" bottleneck without risking critical data loss.
· Quota & Access Logic: Proactively manage storage and compute quotas across research teams to prevent resource contention before it blocks a training run.

Research Tooling & Orchestration
· Experiment Management Systems: Build and maintain the internal CLI tools and dashboards that allow researchers to launch, track, and organize jobs across thousands of GPUs.
· Resource Telemetry: Set up real-time monitoring for interconnect throughput, GPU memory, and file system latency to catch performance degradation instantly.
· Job Orchestration: Work closely with infrastructure teams to optimize how we run synthetic data pipelines and large-scale evaluations, ensuring our tooling scales with our compute.

Research Environment Provisioning
· Automated Workspace Setup: Build the scripts and tooling that instantly provision compute environments, permissions, and storage namespaces for researchers (automating away the manual work).
· Cluster Access Architecture: Streamline SSH and node access protocols to ensure friction-free entry to our massive-scale compute clusters while maintaining security boundaries.

Academic Qualifications
A bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience.

Professional Experience - Minimum (The Bar)
· 3+ years of experience in DevOps, Release Engineering, or MLE, specifically within AI/ML or HPC environments.
· Foundation Model Fluency: You understand the lifecycle of training large models (LLMs or Diffusion). You know what a checkpoint is, you understand the difference between pre-training and inference, and you are familiar with the artifacts required for a model release.
· Linux/Unix Fluency: You live in the command line. You have deep expertise in bash scripting, file system permissions, and SSH configuration.
· Version Control Admin: Expert-level administration of GitHub Enterprise (managing teams, API limits, and repository security).
· Scripting & Automation: Proficiency in Python or Bash to automate repetitive administrative tasks.

Professional Experience - Preferred (The Fit)
· "Gold Standard" Open Source: Experience contributing to or managing high-profile open-source releases (Hugging Face libraries, model families, datasets).
· HPC Schedulers: Deep understanding of Slurm job scheduling and troubleshooting.
· Cloud Storage: Familiarity with cloud storage buckets (S3/GCP) and efficient data transfer tools.
Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability

Visa Sponsorship
This position is eligible for visa sponsorship.

Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability


Crear una alerta de empleo para esta búsqueda

Foundation Model DevOps Engineer • Sunnyvale, CA, US

Ofertas similares

Senior ML Infrastructure / ML DevOps Engineer

PathwayPalo Alto, CA, United States
Indefinido

Be among the first 25 applicants.This range is provided by Pathway.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Pathway is shaking the founda...Mostrar más

 • Oferta promocionada

Head of Software Engineering / DevOps

E-SpaceSaratoga, CA, United States
A tiempo completo

Join to apply for the Principal DevOps Engineer / Head of DevOps Systems role at E-Space.Ready to make connectivity from space universally accessible, secure, and actionable? E-Space is focused on ...Mostrar más

 • Oferta promocionada

DevOps Engineer

TensorSan Jose, CA, United States
A tiempo completo

Tensor is an agentic AI company dedicated to building agentic products that empower individual consumers.Our flagship product, the Tensor Robocar, is the world’s first personal Robocar and the firs...Mostrar más

 • Oferta promocionada

Remote Senior DevOps Engineer - AI Infrastructure & Scale

Newcode.aiPalo Alto, CA, United States
Teletrabajo
A tiempo completo

An innovative AI solutions company is seeking a Senior DevOps Engineer to architect and maintain the core infrastructure supporting cutting-edge AI applications.The role involves designing scalable...Mostrar más

 • Oferta promocionada

Senior ML Infrastructure / DevOps Engineer

Pathway Genomics CorporationPalo Alto, CA, United States
Indefinido

Pathway is shaking the foundations of artificial intelligence by introducing the world’s first post-transformer model that adapts and thinks just like humans.Pathway’s breakthrough architecture (BD...Mostrar más

 • Oferta promocionada

Senior DevOps Engineer

Obsidian SecurityPalo Alto, CA, United States
A tiempo completo

Be among the first 25 applicants.Founded in 2017, Obsidian Security was created to close a critical gap: securing the SaaS applications where modern business happens—platforms like Microsoft 365, S...Mostrar más

 • Oferta promocionada

DevOps Engineer

NeerInfo SolutionsSunnyvale, CA, United States
A tiempo completo

Client is seeking a DevOps Lead.This position will interface with key stakeholders and apply your technical proficiency across different stages of the Software Development Life Cycle including Requ...Mostrar más

 • Oferta promocionada

Senior DevOps Lead (Remote): Cloud & Autonomous Systems

CyngnMountain View, CA, United States
Teletrabajo
A tiempo completo

A leading autonomous vehicle company is looking for a Senior DevOps Lead to architect and manage infrastructure for their cloud and autonomous vehicle systems.The ideal candidate will bridge cloud ...Mostrar más

 • Oferta promocionada

DevOps Engineer

InsideHigherEdStanford, California, United States
A tiempo completo

Business Affairs: University IT (UIT), Redwood City, California, United States.Information Technology Services📅Apr 06, 2026 Post Date📅107201 Requisition #.Enterprise Technologies is a centr...Mostrar más

 • Oferta promocionada

Devops Engineer

TechDigital GroupPleasanton, CA, United States
A tiempo completo

Hands on experience in working with Azure Cloud and Azure Cloud Resources.Hands on experience with working on Azure resources deployment Automation using Terraform, Ansible, Azure CLI & PowerShell....Mostrar más

 • Oferta promocionada

Senior DevOps Engineer

FortinetSunnyvale, CA, United States
A tiempo completo

Fortinet is looking for an enthusiastic and talented Kubernetes expert to join our cloud computing DevOps team to work with software developers and other operational specialists to support our Fort...Mostrar más

 • Oferta promocionada

Sr. SRE / DevOps Engineer (Local to Bay Area only)

Compunnel Inc.Pleasanton, CA, United States
A tiempo completo

SRE / DevOps Engineer (Local to Bay Area only).This range is provided by Compunnel Inc.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Direct me...Mostrar más

 • Oferta promocionada

DevOps Engineer

ALOIS SolutionsMilpitas, CA, United States
A tiempo completo

This range is provided by ALOIS Solutions.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Direct message the job poster from ALOIS Solutions.Wor...Mostrar más

 • Oferta promocionada

Principal DevOps Engineer

SAPSan Ramon, CA, United States
A tiempo completo +1

At SAP, we keep it simple: you bring your best to us, and we\'ll bring out the best in you.We touch over 20 industries and 80% of global commerce, and we need your unique talents to shape what\'s n...Mostrar más

 • Oferta promocionada

DevOps Engineer

Valid8 Financial, Inc.Pleasanton, CA, United States
A tiempo completo

Join our mission of pathogen-proofing indoor environments and ensuring health, efficiency, and biosecurity across all buildings.SafeTraces is a dual use technology company in the San Francisco Bay ...Mostrar más

 • Oferta promocionada

DevOps Engineer

SPECTRAFORCECupertino, California, US
A tiempo completo

Role: Cloud Infrastructure DevOps Engineer IV.Location: Seattle, WA/Cupertino, CA (Can work remote, must come onsite for interview).The following information provides an overview of the skills, qua...Mostrar más

 • Oferta promocionada • Nueva oferta

Senior DevOps Engineer

TENEX.AISan Jose, CA, United States
A tiempo completo

AI-native, automation-first, built-for-scale Managed Detection and Response (MDR) provider.We are a force multiplier for defenders, helping organizations enhance their cybersecurity posture through...Mostrar más

 • Oferta promocionada

Software Platform Engineer, DevOps/SRE

Muon SpaceSan Jose, CA, United States
Indefinido

Muon Space is leveraging our proprietary constellation of Earth‑observation satellites to monitor and address climate change and growing customer needs, and we need a Software Platform Engineer to ...Mostrar más