Provide a summary of the job's primary function :
24 / 7 business continuity :
This role requires availability outside of traditional working hours on a rotating basis to ensure continuous operation of critical AI systems and data pipelines. Responsibilities include monitoring system health, responding to alerts, troubleshooting performance issues, and implementing emergency fixes as needed. The ideal candidate must be able to quickly diagnose and resolve AI system and data pipeline incidents, prioritize issues based on business impact, and coordinate with technical teams to restore service. A strong commitment to system reliability and service continuity is essential for success in this position.
Other duties as required :
This role requires flexibility in performing duties outside of the primary responsibilities to support the evolving AI ecosystem at the university. The ideal candidate must be adaptable and willing to take on additional tasks or projects as required, ensuring consistent and reliable AI and data pipeline operations. This may include assisting with knowledge management, documentation updates, user training, data preparation, or special projects related to AI system improvements. A problem-solving mindset and willingness to tackle emerging challenges are essential for thriving in this dynamic environment.
Hybrid work schedule :
This role is hybrid and in the office a minimum of three days a week to facilitate collaboration with both technical teams and operations staff. In-office presence enables effective coordination with support teams, direct access to infrastructure, and hands-on troubleshooting of AI systems and data pipelines. Physical presence is particularly important for incident response, change management activities, and cross-functional problem-solving sessions that benefit from in-person collaboration and real-time communication.
1. Minimum Qualifications
2. Key Responsibilities & Accountabilities
Identify the most important job duties (maximum of 5) using no more than 3-4 concise sentences. Indicate the typical percent of time required for each job duty; the total percent of time must equal 100%. Begin with the most important duty.
Percent of Time
System Monitoring and Incident Management
Monitor AI system and data pipeline health, performance, and availability using established monitoring tools and dashboards. Detect, triage, and resolve incidents affecting AI systems and their data infrastructure, coordinating with technical teams as needed. Implement proactive measures to prevent recurring issues and minimize service disruptions.
35%
Operational Support and Maintenance
Perform routine operational tasks to maintain AI systems and data pipelines, including model updates, data refreshes, pipeline maintenance, and system patches. Implement scheduled maintenance activities with minimal service disruption. Manage user access and permissions for AI platforms according to security policies.
25%
20%
10%
Performance Analysis and Optimization
Analyze AI system and data pipeline performance metrics, identify bottlenecks and inefficiencies, and implement optimizations to improve response times, data flow, accuracy, and resource utilization. Monitor for model drift and data quality issues, coordinating retraining or pipeline adjustments when necessary.
Documentation and Knowledge Management
Create and maintain comprehensive operational documentation, including runbooks, standard operating procedures, and knowledge base articles. Document system configurations, data pipeline dependencies, and recovery procedures to ensure operational continuity.
Continuous Improvement and Automation
Identify opportunities for process improvement and automation in AI operations. Develop and implement scripts and workflows to automate routine tasks, reducing manual effort and minimizing human error. Contribute to the evolution of MLOps practices based on operational experience and emerging best practices.
10%
Hybrid 3 days onsite
It Support • Boston, MA, United States