The SR Director of IT Operations & Service Excellence is the strategic and operational leader responsible for uptime and resiliency of systems across BJs digital and enterprise technology landscape (across applications, infrastructure and security) to provide worldclass experiences to our members and team members. The role sets the northstar for what good looks likedefining and publishing servicelevel objectives (SLOs / SLIs) and operational key resultswhile building the organizational muscle to deliver them consistently. Reporting to the VP of Infrastructure & Operations, this leader balances realtime incident response with multiyear servicereliability vision, enabling teams to see the forest through the trees and make datadriven tradeoffs.
Key Responsibilities
Strategic Leadership
- Define and execute the multiyear IT Service Excellence maturity roadmap aligned to business objectives, cloud migration plans, uptime and resiliency requirements.
- Craft multiyear resiliency and costoptimization roadmap aligned to company growth goals.
- Implement IT operations best practices
- Collaborate with product development teams and influences them to ensure reliability and scalability are considered at the design phase.
- Partner with Enterprise Architecture to define standards for building reliable applications that are highly available and resilient.
- Define Service Level Objectives (SLOs), Service Level Indicators (SLIs) for all critical services.
- Foster a hightrust, blameless culture that rewards learning, experimentation, and excellence.
- Own the IT Operations & Service Excellence budget; optimize OpEx through automation, selfservice, and vendor management.
IT Operations & Incident Management (247 Command Center, NOC & Service Desk)
Oversee realtime monitoring, incident triage, and majorincident management ensuring MTTR and communications SLAs are met.Maintain a highperforming L1 Service Desk; drive call deflection via knowledge, AI chatbots, and selfservice password reset.Publish operational metrics (MTTA, MTTR, FCR, abandon rate) with actionable insights.Lead the major incident management function, including defining escalation paths, coordinating cross-functional teams, and ensuring timely communication to stakeholdersOversee the entire incident lifecycle, from identification and triage to resolution and post-incident analysis, ensuring efficient and effective processes are in place.Manage on-call rotations and ensure 24 by 7 coverage with major incident managersEnsure a robust playbook is developed and followed during a MIM process with clearly assigned roles, communication protocols and a well defined triaging processMatrix management of people, processes and resources including third parties including resolving conflict to move forward to resolutionChange & Release Governance
Chair the Change Advisory Board (CAB); uphold 99%+ change success while accelerating deployment velocity.Implement riskbased change classification; Ensure thoroughness of end to end testing, automated predeployment checks, rollback processes in place and postimplementation reviews.Service Reliability Engineering (SRE) & Observability
Develop and implement SRE policies, standards, and best practices for enterprise-wide systems.Lead SRE squads covering AWS, colocation data centers, network / edge, and SaaS platforms.Set error budgets, reliability targets, and chaosengineering practices; ensure recovery time and point objectives (RTO / RPO) meet or exceed DR objectives and business expectations.Work with Service managers overseeing SRE functions for Digital, Membership, Enterprise, and Club & Fuel systems and deliver integrated SRE.Drive endtoend service designservice maps, dependency graphs, support modelsto complement observability tooling.Lead the roadmap for logging, metrics, tracing, and AIOps platforms, delivering actionable insights and predictive alerting.Engineering Excellence and Practices :
Understand the potential impact of system requirements and design choices across multiple cloud and on-premise technologiesContinuously work on enhancing the reliability, stability, and performance of our key platforms, being at the forefront of promoting engineering excellence, implementing best practices, and overseeing the integration of fully automated telemetry within modern DevOps frameworksAdvance problem detection and ensure service restoration processes are well definedUtilizing cutting-edge Site Reliability Engineering methods, coupled with automated alerting and self-healing mechanisms, improve both cloud-based and on-premises systems, thereby fortifying our digital infrastructures robustness and efficiencyProcess Ownership & Continuous Improvement
Codify SOPs and RACI matrices across Ops, SRE, Service Desk, and engineering partners to drive clarity of ownership.Lead Lean / Kaizen initiatives that reduce toil and amplify engineering productivity.Track and report OKRs; coursecorrect based on data.Drive rootcause analysis (RCA) and problem management; close systemic gaps and prevent recurrence of major incidents.Compliance, Security & Risk
Partner with Cybersecurity and Compliance teams to meet PCIDSS, SOX, and dataprivacy obligations.Ensure operational controls withstand internal and external audits.People Development
Possess robust technical expertise and leadership qualities to lead by example with a proven track record in Site Reliability EngineeringFoster a culture of psychological safety, empowerment, and continuous learning.Coach and develop managers; Build, mentor, and retain organization spanning Service Desk, Command Center, SRE, Change Governance, Problem Management and Analytics.Required Qualifications
Bachelors degree in Computer Science, Engineering, or related discipline (Masters preferred).15+ years of progressive IT Operations leadership with 5+ years at a Director / Head level supporting largescale, Retail and distributed environments.Proven track record of leading teams through complex system outages and scalability challenges.5+ years of proven oversight of 247 operations (NOC, Service Desk) and SRE or DevOps functions.Proficiency in system design and architecture, particularly in a cloud environment.Demonstrated success operating hybrid cloud (AWS) and onprem datacenter environments.Expertise with ITIL v4 / Service Management frameworks; ITIL certification strongly desired.Experience implementing observability, AIOps, and automation platforms (e.g., ServiceNow, Ops Ramp, SolarWinds, New Relic, PagerDuty).Outstanding communication skills and executive presence; able to brief Csuite on risk and performance.Preferred Qualifications
Retail industry experience managing store, fuel, and distribution center technologies.Certifications in ServiceNow.Lean Six Sigma or Continuous Improvement accreditation.Leadership Competencies
Strategic Thinking / ForestThroughtheTrees : Articulates longterm vision while executing tactically under pressure.Influence & Communication :Excellent verbal and written communication skills. Experience presenting to C-level executives and stakeholders.Translates technical concepts into business outcomes for executives and frontline associates.Servant Leadership : Builds inclusive teams and empowers others to experiment and learn.Accountability : Holds self and teams to high standards; measures what matters.Change Catalyst : Leads through ambiguity, driving adoption of new ways of working.Work Environment & Travel
Hybrid work model (Westborough, MA HQ) with periodic visits to colocation data centers, distribution centers, and club locations. Afterhours or weekend availability required for major incidents or change windows. Occasional travel (