Talent.com
Foundation Model DevOps Engineer
Foundation Model DevOps EngineerInstitute of Foundation Models • Sunnyvale, CA, US
No longer accepting applications
Foundation Model DevOps Engineer

Foundation Model DevOps Engineer

Institute of Foundation Models • Sunnyvale, CA, US
30+ days ago
Job type
  • Full-time
Job description

Job Description

Job Description
About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.



About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.
As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role
We are seeking a Foundation Model DevOps Engineer focused on Operational Stability to serve as the backbone of our AI research infrastructure.
You will be designing the friction-free environment that allows our models to be built. Your mandate is to build the tooling, release pipelines, and storage policies that remove drag on our research team. You will own the "foundational layer", ensuring that our researchers have immediate, secure, and reliable access to the tools, data, and compute they need.

Key Responsibilities

Model Release Engineering
· High-Fidelity Release Management: You own the standard of our public presence. You ensure that every release (weights, code, training logs, data) is reproducible, meticulously documented, and packaged with the polish of a top-tier open-source product. CI/CD for Research: Design and implement pipelines that automate the testing and packaging of complex model releases, moving us away from manual handovers to automated verification.
· Repo Administration: Administer the organization’s GitHub Enterprise account, ensuring branch protection and clean versioning practices are enforced across the lab.

Resource Management & Infrastructure Efficiency
· Compute Governance: Manage the efficiency of our large-scale GPU resources. You track utilization to identify idle nodes, "zombie jobs," or inefficient scheduling, ensuring we extract maximum value from our compute clusters.
· Storage Strategy & Hygiene: Manage the lifecycle of petabyte-scale datasets and checkpoint storage. You implement intelligent aging policies to solve the "disk full" bottleneck without risking critical data loss.
· Quota & Access Logic: Proactively manage storage and compute quotas across research teams to prevent resource contention before it blocks a training run.

Research Tooling & Orchestration
· Experiment Management Systems: Build and maintain the internal CLI tools and dashboards that allow researchers to launch, track, and organize jobs across thousands of GPUs.
· Resource Telemetry: Set up real-time monitoring for interconnect throughput, GPU memory, and file system latency to catch performance degradation instantly.
· Job Orchestration: Work closely with infrastructure teams to optimize how we run synthetic data pipelines and large-scale evaluations, ensuring our tooling scales with our compute.

Research Environment Provisioning
· Automated Workspace Setup: Build the scripts and tooling that instantly provision compute environments, permissions, and storage namespaces for researchers (automating away the manual work).
· Cluster Access Architecture: Streamline SSH and node access protocols to ensure friction-free entry to our massive-scale compute clusters while maintaining security boundaries.

Academic Qualifications
A bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience.

Professional Experience - Minimum (The Bar)
· 3+ years of experience in DevOps, Release Engineering, or MLE, specifically within AI/ML or HPC environments.
· Foundation Model Fluency: You understand the lifecycle of training large models (LLMs or Diffusion). You know what a checkpoint is, you understand the difference between pre-training and inference, and you are familiar with the artifacts required for a model release.
· Linux/Unix Fluency: You live in the command line. You have deep expertise in bash scripting, file system permissions, and SSH configuration.
· Version Control Admin: Expert-level administration of GitHub Enterprise (managing teams, API limits, and repository security).
· Scripting & Automation: Proficiency in Python or Bash to automate repetitive administrative tasks.

Professional Experience - Preferred (The Fit)
· "Gold Standard" Open Source: Experience contributing to or managing high-profile open-source releases (Hugging Face libraries, model families, datasets).
· HPC Schedulers: Deep understanding of Slurm job scheduling and troubleshooting.
· Cloud Storage: Familiarity with cloud storage buckets (S3/GCP) and efficient data transfer tools.
Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability

Visa Sponsorship
This position is eligible for visa sponsorship.

Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability


Create a job alert for this search

Foundation Model DevOps Engineer • Sunnyvale, CA, US

Similar jobs
DevOps Engineer – Terraform & Ansible

DevOps Engineer – Terraform & Ansible

Axelon Services Corporation • San Jose, CA, US
Full-time
Job Title: DevOps Engineer Terraform & Ansible (Onsite) Location: San Jose, CA | Pay: 60/hr About the Role We are looking for a hands-on DevOps Engineer to join our team in building and maintaining...Show more
Last updated: 2 days ago • Promoted
Senior ML Infrastructure / ML DevOps Engineer

Senior ML Infrastructure / ML DevOps Engineer

Pathway • Palo Alto, CA, United States
Permanent
Be among the first 25 applicants.This range is provided by Pathway.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Pathway is shaking the founda...Show more
Last updated: 11 days ago • Promoted
Senior DevOps Engineer /Pleasanton, CA ( Hybrid ), 6 Months Contract

Senior DevOps Engineer /Pleasanton, CA ( Hybrid ), 6 Months Contract

Suncap Technology • Pleasanton, CA, United States
Temporary
Location: Hybrid - Pleasanton, CA (candidate to be open to go to office for team meetings from time to time).Previous experience as devops on the: Docker, Kubernetes, Jenkins.Scripting/ coding skil...Show more
Last updated: 13 days ago • Promoted
DevOps Engineer

DevOps Engineer

Tensor • San Jose, CA, United States
Full-time
Tensor is an agentic AI company dedicated to building agentic products that empower individual consumers.Our flagship product, the Tensor Robocar, is the world’s first personal Robocar and the firs...Show more
Last updated: 11 days ago • Promoted
Senior Platform Devops Engineer- Architecture

Senior Platform Devops Engineer- Architecture

Pyramid Consulting • Milpitas, CA, United States
Temporary
Senior Platform Devops Engineer- Architecture.Immediate need for a talented Senior Platform Devops Engineer- Architecture.This is a 06+months contract to hire opportunity with long-term potential a...Show more
Last updated: 30+ days ago • Promoted
Head of Software Engineering / DevOps

Head of Software Engineering / DevOps

E-Space • Saratoga, CA, United States
Full-time
Ready to make connectivity from space universally accessible, secure, and actionable? Then you’ve come to the right place!.At E-Space, we’re focused on bridging Earth and space with the world’s mos...Show more
Last updated: 1 day ago • Promoted
Senior ML Infrastructure / DevOps Engineer

Senior ML Infrastructure / DevOps Engineer

Pathway Genomics Corporation • Palo Alto, CA, United States
Permanent
Pathway is shaking the foundations of artificial intelligence by introducing the world’s first post-transformer model that adapts and thinks just like humans.Pathway’s breakthrough architecture (BD...Show more
Last updated: 9 days ago • Promoted
DevOps SRE

DevOps SRE

Insight Global • Palo Alto, CA, United States
Full-time
This will be the first time this lab is being built, starting with a 20-25 rack deployment, followed by proofofconcept expansions and eventual scale.You will play a key role in defining standards, ...Show more
Last updated: 4 days ago • Promoted
Senior DevOps Engineer

Senior DevOps Engineer

Intellipro Group • Palo Alto, CA, United States
Full-time
You will continue to develop and empower a diverse team of developers, providing technical guidance and direction and anticipating future resource needs in line with business goals and priorities.D...Show more
Last updated: 30+ days ago • Promoted
Senior DevOps Engineer

Senior DevOps Engineer

Scout Motors • Fremont, CA, United States
Full-time
Here at Scout Motors, we're carrying forward the heritage of one of the most iconic American vehicles in history.One that forged the path for future generations of rugged SUVs and trucks and will d...Show more
Last updated: 5 days ago • Promoted
DevOps Engineer

DevOps Engineer

NeerInfo Solutions • Sunnyvale, CA, United States
Full-time
Client is seeking a DevOps Lead.This position will interface with key stakeholders and apply your technical proficiency across different stages of the Software Development Life Cycle including Requ...Show more
Last updated: 11 days ago • Promoted
DevOps Engineer

DevOps Engineer

InsideHigherEd • Stanford, California, United States
Full-time
Business Affairs: University IT (UIT), Redwood City, California, United States.Information Technology Services📅Jan 20, 2026 Post Date📅107201 Requisition #.Enterprise Technologies is a centr...Show more
Last updated: 30+ days ago • Promoted
Devops Architect

Devops Architect

Commscope • Sunnyvale, California, US
Full-time
In our ‘always on’ world, we believe it’s essential to have a genuine connection with the work you do.RUCKUS Networks builds and delivers purpose-driven networks that perform in the tough, unique e...Show more
Last updated: 5 days ago • Promoted
DEVOPS ENGINEER

DEVOPS ENGINEER

CHALLA LAW OFFICE, PC. • Santa Clara, CA, United States
Full-time
Redifcard Technologies LLC seeks DevOps Engineer to Optimize integrate DevOps/Cloud/DB sys/apps/ infra Req MS & 6 mons exp Mail resumes to 8120 Penn Ave S Ste 100-D Bloomington MN 55431.Show more
Last updated: 5 days ago • Promoted
DevOps Engineer

DevOps Engineer

ALOIS Solutions • Milpitas, CA, United States
Full-time
This range is provided by ALOIS Solutions.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Direct message the job poster from ALOIS Solutions.Wor...Show more
Last updated: 11 days ago • Promoted
Sr Devops Engineer

Sr Devops Engineer

Staffing the Universe • Pleasanton, CA, United States
Full-time
DevOps Engineer Location Pleasanton, CA.Proven ability to thrive in a fast-past development operations role - 5+ years of relevant experience ideal.Ability to communicate and collaborate cross-func...Show more
Last updated: 30+ days ago • Promoted
DevOps Engineer

DevOps Engineer

Valid8 Financial, Inc. • Pleasanton, CA, United States
Full-time
Join our mission of pathogen-proofing indoor environments and ensuring health, efficiency, and biosecurity across all buildings.SafeTraces is a dual use technology company in the San Francisco Bay ...Show more
Last updated: 11 days ago • Promoted
DevOps Engineer (FortiAppSec)

DevOps Engineer (FortiAppSec)

Fortinet, Inc. • Sunnyvale, CA, United States
Full-time
We are seeking a highly skilled DevOps Engineer to join our team.In this role, you will design, implement, and maintain scalable, resilient, and secure infrastructure.You will work closely with Dev...Show more
Last updated: 11 days ago • Promoted