DevOps Engineer
ABOUT THIS FEATURED OPPORTUNITY
The DevOps Engineer will join the Channel Sales and Operations team to support the deployment, infrastructure, and scaling of an AI / ML data platform that powers both web applications and machine learning workloads. This role sits at the intersection of cloud infrastructure, Kubernetes, and emerging AI / ML systems, and will play a key role in ensuring reliability, performance, and scalability as ML usage grows.
THE OPPORTUNITY FOR YOU
- Design, deploy, and operate cloud infrastructure for an AI / ML data platform supporting web applications and machine learning workloads at scale.
- Manage Kubernetes environments (EKS preferred), including provisioning resources, installing software, and scheduling CPU- and GPU-based workloads with high memory requirements.
- Support AWS and GCP environments, with a focus on AWS for application infrastructure and GCP for AI / ML workloads.
- Enable AI / ML workflows by supporting model and artifact storage, retrieval, and scaling strategies in partnership with ML teams.
- Diagnose and triage networking, performance, and system interaction issues, clearly articulating testing approaches and remediation plans.
- Implement monitoring and observability solutions (Grafana preferred), build dashboards, understand log flows, and collaborate cross-functionally to resolve production issues.
KEY SUCCESS FACTORS
years of DevOps Engineering experience Understanding of networking fundamentals routing, latency, service communication Experience supporting microservices -based architectures at scale Experience configuring systems with high memory and specialized compute requirementsCloud experience with AWS and / or GCP Experience working with Kubernetes clusters, including provisioning and deprovisioning cluster resources, installing and managing platform software and assigning and scheduling workloads across CPU and GPU nodesExperience with monitoring and observability tools ( Grafana strongly preferred)NICE TO HAVES
MLOps experience, including model storage, versioning, and artifact management