MLops Engineer

ArrayoBoston, MA, United States

4 days ago

Job type

Full-time

Job description

MLops Engineer (Training Scalability & Workflow Optimization)

We are seeking an MLops Engineer to lead the scaling of machine learning training pipelines and ensure the robustness and efficiency of our end-to-end ML workflows. This role focuses on leveraging Flyte , Kubernetes (GPU optimization), Docker , and distributed training frameworks such as Ray to optimize and streamline our ML infrastructure.

Overview

This role focuses on leveraging Flyte , Kubernetes (GPU optimization), Docker , and distributed training frameworks such as Ray to optimize and streamline our ML infrastructure.

Responsibilities

Workflow Orchestration : Develop and maintain ML workflows using Flyte to manage complex ML pipelines for training, testing, and deployment.
Training Scalability : Architect and scale large-scale ML training systems on GPU-backed Kubernetes clusters , including auto-scaling and performance tuning for multi-node / multi-GPU workloads.
Distributed Computing : Implement distributed model training pipelines using frameworks like Ray for parallelization and resource efficiency.
Containerization : Design, build, and optimize Docker images for ML workloads with a focus on reproducibility and security.
Resource Optimization : Debug and optimize GPU utilization, memory, and compute bottlenecks during training and inference phases.
Monitoring & Maintenance : Integrate monitoring for ML jobs, track resource consumption, and enforce cost-efficient resource utilization.
Collaboration : Work closely with data scientists and ML engineers to productize and scale ML experiments.

Qualifications

Strong proficiency with Kubernetes (GPU scheduling, Helm, cluster autoscaling).

Hands-on experience with Flyte or similar workflow orchestration tools (Airflow, Prefect).

Deep knowledge of distributed ML training (e.g., PyTorch DDP, Ray, Horovod).

Expertise in Docker and container lifecycle management.

Solid understanding of GPU hardware / software stack (CUDA, NCCL).

Familiarity with CI / CD for ML (MLops pipelines using tools like GitHub Actions, ArgoCD).

Bonus : Familiarity with observability tools for ML systems (Prometheus, Grafana).

#J-18808-Ljbffr

Create a job alert for this search

Mlops Engineer • Boston, MA, United States

Related jobs

Promoted

Solution Deployment Engineer

Flock SafetyBoston, MA, US

Full-time

Flock Safety is the leading safety technology platform, helping communities thrive by taking a proactive approach to crime prevention and security. Our hardware and software suite connects cities, l...Show moreLast updated: 30+ days ago

Promoted

Senior MLOps Engineer

EBSCO Information ServicesBoston, MA, United States

Full-time

EBSCO Information Services (EBSCO) delivers a fully optimized research experience, seamlessly integrated with a powerful discovery platform to support the information needs and maximize the researc...Show moreLast updated: 4 days ago

Promoted

Senior Software Engineer (ML Operations)

WhoopBoston, MA, US

Full-time

At WHOOP, we're on a mission to unlock human performance and healthspan.WHOOP empowers members to perform at a higher level through a deeper understanding of their bodies and daily lives.We are...Show moreLast updated: 21 days ago

Promoted

Principal Fuel Systems Engineer (R3300) (Remote)

Shield AIBoston, MA, United States

Remote

Full-time +1

Founded in 2015, Shield AI is a venture-backed deep-tech company with the mission of protecting service members and civilians with intelligent systems. Its products include the V-BAT aircraft, Hivem...Show moreLast updated: 20 days ago

Promoted

Telemedicine Physician

QuickMDScituate, MA, US

Full-time

QuickMD is a leading telemedicine provider, delivering high-quality virtual care across 44 states.Since our founding in 2019, we have helped more than 100,000 patients access essential medical trea...Show moreLast updated: 30+ days ago

Promoted

Senior ML Platform Engineer

WhoopBoston, MA, US

Full-time

Promoted

Free CDL Training and Job Placement for a Fresh Start

EmergeCohasset, MA, US

Full-time

We are a government-funded job training program for people who have been impacted by the justice system (arrest, probation, parole, or incarceration), and we help them become CDL truck drivers.Free...Show moreLast updated: 1 day ago

Promoted

Lead Semiconductor Reliability Engineer

RaytheonAndover, MA, US

Full-time

MA112 : Andover MA 358 Lowell St Dukes 358 Lowell Street Dukes, Andover, MA, 01810 USA.Person, or Immigration Status Requirements : . The ability to obtain and maintain a U.At Raytheon, the foundation ...Show moreLast updated: 30+ days ago

Promoted

Senior Systems Applications Engineer (GMSL)

1010 Analog Devices Inc.Wilmington, MA, United States

Full-time +1

NASDAQ : ADI ) is a global semiconductor leader that bridges the physical and digital worlds to enable breakthroughs at the Intelligent Edge. ADI combines analog, digital, and software technologie...Show moreLast updated: 19 days ago

Promoted

Senior Software Engineer, ML Infrastructure

MotionalBoston, MA, US

Full-time

Our team builds the foundational infrastructure that empowers Machine Learning Engineers to develop the next generation of self-driving technology. We design and operate the high-performance, large-...Show moreLast updated: 30+ days ago

Promoted

Principal Software Engineer, ML Infrastructure

MotionalBoston, MA, US

Full-time

Promoted

Engineer, Senior

Constellation EnergyCanton, MA, US

Full-time

As the nation's largest producer of clean, carbon-free energy, Constellation is focused on our purpose : accelerating the transition to a carbon-free future. We have been the leader in clean ener...Show moreLast updated: 3 days ago

Promoted

Operations Associate

Beacon HillNorwell, MA, US

Full-time

Our client, a leading provider of environmental and industrial services located in Norwell, MA, is seeking a.Fleet Management Administrator. Qualified and interested individuals are encouraged to ap...Show moreLast updated: 1 day ago

Promoted

Manufacturing Systems Engineer Level 1 or 2

Northrop GrummanPlymouth, US

Full-time

RELOCATION ASSISTANCE : Relocation assistance may be available.At Northrop Grumman, our employees have incredible opportunities to work on revolutionary systems that impact people's lives around th...Show moreLast updated: 4 days ago

Promoted

Machine Learning Operations Engineer

Cyvl, Inc.Boston, MA, United States

Full-time

Cyvl is a Boston-based tech startup revolutionizing the way civil engineering firms and governments map and manage transportation infrastructure. Our enterprise-grade hardware and software solutions...Show moreLast updated: 30+ days ago

Promoted

Senior MLOps Engineer, vLLM Inference

Red HatBoston, MA, United States

Full-time +1

Senior MLOps Engineer, vLLM Inference page is loaded## Senior MLOps Engineer, vLLM Inferenceremote type : Hybridlocations : Boston : Dublin - MSO : Remote Ireland : Waterford Citytime type : Full timepos...Show moreLast updated: 4 days ago

Promoted

MuleSoft QA Engineer

UniFirstWilmington, MA, US

Full-time

This is a hybrid role with 50% on-site requirement in Wilmington, MA.The ideal candidate will validate.Functional Solution Documents (FSDs). The role requires proficiency in.API test automation tool...Show moreLast updated: 1 day ago

Promoted

P4 Principal Mechanical Engineer Lead

RaytheonTewksbury, MA, US

Full-time

MA131 : Tewksbury, MA Bldg 1 Assabet 50 Apple Hill Drive Assabet - Building 1, Tewksbury, MA, 01876 USA.Person, or Immigration Status Requirements : . The ability to obtain and maintain a U.At Raytheo...Show moreLast updated: 6 days ago