Summary
Responsibilities :
– You will need to spend 50% of your time on and around production support, including the handling of user tickets, incidents and problem management
– You will identify and create automation to eliminate manual day to day support activities; scope and create automation for deployment, management and visibility of our services.
– Automate to drive efficiency by designing an autonomous system- Manage Service reliability by managing risk
– Define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs).
– Implement best practices for building successful monitoring and alerting systems
– You will use your expertise to tune and push our systems beyond their normal limit.
– You will work closely with engineering / development teams to design, build, and maintain systems and help them decide on products to use, schema design and query tuning.
– You will troubleshoot issues across the entire stack : hardware, software, application and network.
– You will mentor other SREs on standard methodology for everything from monitoring to troubleshooting complex code and database issues.
– Represent the SRE organization in design reviews and operational readiness exercises for new and existing services.
– Participate in on-call rotation and periodic conference calls with other specialists from other time zones.
Required Technical Skills :
– Bachelor’s Degree / background in Computer Science
– Experience in software development : automation-related experience valued in particular. Scripting languages such as bash, python, ruby, or compiled languages such as C, C#, JAVA, Scala and Go are most relevant but others are acceptable. One higher level language is desired.
– Hands on experience using Enterprise Tools such as App Dynamic, Grafana, Splunk, Dynatrace
– Three Tier Support experience with DBs such as IBM, DB2, Sybase, Mongo, Green Plum, KDB
– Professional ownership of issues
– Deep understanding of operating system level concepts such as processes, memory allocation, and the network stack; an understanding of how applications are affected by the above, and ability to debug same.
– Generally speaking, practical experience running large scale online systems is always an advantage.
– Awareness of, and ability to reason about modern software & systems architectures, including load-balancing, queueing, caching, distributed systems failure modes, micro services, Cloud, etc.
Desired Skills
– Knowledge of messaging layer : MQ / CPS / XML- Knowledge of SFTP / Comet- ServiceNow – Prior experience as a developer / support role in a large-scale financial firm
Production Support • New York, NY