Talent.com
SRE/MLOps Engineer
SRE/MLOps EngineerSAIC • Virginia, MN, US
SRE / MLOps Engineer

SRE / MLOps Engineer

SAIC • Virginia, MN, US
3 days ago
Job type
  • Full-time
Job description

Description

We are seeking a versatile SRE / MLOps Engineer with DevSecOps expertise to design, automate, and operate secure, scalable, and repeatable model deployment workflows across the AI / ML Common Services environment. This role bridges infrastructure reliability, CI / CD automation, and model operations , enabling IRS mission teams to move from experimentation to production with confidence.

The engineer will not only support ML lifecycle operations (Databricks, MLflow, AWS SageMaker / Bedrock) but also bring DevSecOps rigor to ensure compliance, monitoring, and infrastructure-as-code are embedded in every step. By partnering with Infrastructure, Security, and Architecture teams, this role ensures the AAP environment is resilient, automated, and compliance-ready at enterprise scale.

Key Responsibilities

  • Enable secure, scalable, and repeatable deployment workflows for both ML models and supporting infrastructure.
  • Build and maintain runtime environments, service accounts, orchestration logic for Databricks, MLflow, and AWS AI services.
  • Implement and maintain CI / CD pipelines (Bitbucket, Bamboo, Jenkins, or equivalent) for code, data, and model deployments.
  • Apply DevSecOps practices — integrating security scans, compliance checks, and audit logging into deployment pipelines.
  • Collaborate with Infrastructure DSO and Solutions Architect to integrate Terraform-based IaC for consistent, automated provisioning.
  • Implement observability, alerting, and logging (CloudWatch, Datadog, Prometheus) to monitor both application and ML workloads.
  • Align infrastructure with ML lifecycle needs — including staging, promotion, rollback, retraining, and compliance-aware tracking.
  • Develop automation templates, reusable workflows, and guardrails to accelerate onboarding of mission team models while ensuring security.
  • Contribute to incident response, performance tuning, and reliability engineering across ML and non-ML workloads.

Qualifications

Required Qualifications

  • Bachelor's or master's degree in computer science, Data Engineering, or a related technical discipline.
  • 5+ years of experience in Site Reliability Engineering, DevOps, or MLOps with production-grade systems.
  • Must be a U.S. Citizen with the ability to obtain and maintain a Public Trust security clearance.
  • Hands-on experience with Databricks, MLflow, or AWS SageMaker / Bedrock for ML model lifecycle operations.
  • Strong proficiency in Terraform, CI / CD pipelines , and container orchestration (Docker, Kubernetes).
  • Experience implementing security automation (e.g., IaC scanning, container security, SAST / DAST tools) within CI / CD workflows.
  • Solid understanding of observability stacks (logs, metrics, tracing) and best operational practices.
  • Desired Skills

  • Active IRS clearance highly desired.
  • Experience in federal or regulated environments with security, audit, and compliance requirements (FedRAMP, NIST 800-53).
  • Knowledge of Trustworthy AI monitoring (bias detection, drift monitoring, explainability).
  • Familiarity with Unity Catalog, Delta Lake, and data pipeline orchestration in Databricks.
  • Hands-on experience with Zero Trust security models and secure boundary implementations.
  • Relevant certifications such as :
  • Databricks Certified Machine Learning Professional.

  • AWS DevOps Engineer – Professional.
  • Certified Kubernetes Administrator (CKA).
  • Security+ or equivalent security cert.
  • Target salary range : $120,001 - $160,000. The estimate displayed represents the typical salary range for this position based on experience and other factors.

    J-18808-Ljbffr

    Create a job alert for this search

    Engineer • Virginia, MN, US