Job Description
Job Description
Title : Platform Operations Engineer
Location : New York City
Position : Full Time Employment
Core Responsibilities :
- Manage all Digital Infrastructure related backend systems / services currently residing on AWS Cloud including users’ access, network connectivity, Linux / Windows systems, databases, and applications management.
- Deploy updates and patches to servers as well as connected client systems in off-hours maintenance windows.
- Identify, troubleshoot and resolve both server and client issues by analyzing logs from all digital infrastructure components.
- Set up and continue to improve monitoring / alerting matrices for all supported platforms.
- Proactively review key operating matrices and status to ensure all systems are running under recommended operational conditions.
- Participate in designing and implementing of mechanisms for redundancy, failover, and disaster recovery.
- Develop tools and scripts to automate routine tasks.
- Collaborate with NOC, DevOps, and Engineering teams to harden, streamline, and document operating processes.
- Work closely with Head of Digital Infrastructure to improve operability, supportability, usability, and visibility of the digital infrastructure.
- Assist in continuous improvement of operational processes for better utilization of underlying cloud resources.
Requirements :
At least 5+ years of direct working experience in operating production digital infrastructure with strong scripting and system administration skills for both Linux and Windows operating systems.At least 3 years AWS administration experience including but not limited to OpsWorks, VPC, EC2 / ECS, S3, RDS, IAM, ES and EMR servicesWorking knowledge of advanced message queuing and extensible messaging and presence protocolsWorking knowledge of modern system operating tools for monitoring and centralized logging.Experience with automation and configuration management using Chef and AnsibleAbility to use a variety of open source technologies and integrating them with cloud servicesExperience in managing PostgreSQL, MySQL, MS SQL and NoSQL clustersWorking knowledge for securing data and ensuring operating redundancy in cloud environmentAbility to evaluate system and application logs, error messages, stack traces to quickly identify and solve production problemsUnderstanding of best practice and data center operations in an always-up, always-available setupAbility to create and maintain up to date infrastructure documentation including systems, networks, databases, and their interactionsAbility to adhere to established operations procedures and policiesAbility to create clear steps by steps knowledge base documents for NOC to follow and resolve known issuesParticipate in 24x7 on call rotationsBachelor’s degree in relevant fields