Site Reliability Engineer - Phoenix, AZ - W2 Only!
TEK Connexion
Phoenix, AZ, United States
Full-time
Key responsibilities :
- Monitor infrastructure, servers, middleware, databases, and batch jobs.
- Aggressively respond to service requests from business partners facing support teams, Operations, Risk / control partners, etc.
- Troubleshoot environment, data control and operational issues.
- Create and Maintain documentation to ensure knowledge accessibility.
- Automate and streamline process using scripts and scheduling tools.
- Liaise with other application support teams and internal / external business and technical partners.
- Provide ad hoc and on-demand reports.
- Perform timely escalation of critical issues and proactively identify patterns of recurring issues to improve production.
- Lead problem resolution and conduct root cause analysis and establish processes that will help incident prevention.
- Participates in the Incident and Problem Management processes as a resolver accountable for root cause analysis, resolution and reporting.
- Ensures that all production changes are processed according to Change Management policies and procedures.
- Ensures that appropriate levels of Quality Assurance have been met for all new and existing products.
- Support Sustained Resiliency, Disaster Recovery, and High Availability events.
- Help Level 2 operation team with setting up monitoring and bridging the gaps in current monitoring setup.
- Play key part in setting up reporting and be a key component in Monitor ->
Report ->
Improve principle
- Coordinate incident management coverage, to ensure appropriate coverage.
- Call facilitation, coordination and communications during critical outage situations.
- Call documentation, queue management, ticket analysis and interface to impacting lines of business for incident impact analysis via the Production Assurance process.
- End to end view of issues for objectivity.
- Influence senior technology leads across organizations to ensure timely resolution of incidents
- Problem Management :
- Participate and ensure RCA (root cause analysis) activities on client impacting incidents are executed and action items are assigned / completed.
- Provide expertise and support during critical incidents, interfacing with all impacted groups to better manage the message.
- Chronic issue coordination and leadership.
- Guidance to all staff involved and vendors in driving a coordinated approach for results.
- Hygiene and Capacity Maintenance :
- Responsible for data quality of PLM.
- Work aggressively to make sure all servers are up to company standards as per uptimes, patch level etc.
- Work on Capacity planning for applications, estimating and analyzing growth rates of vital infrastructure components and adding capacity pro actively as and when required.
- Understand application code, work flow and business usage of application.
- Understand DB component of application.
- Understand the impacts of application based on seasonality of critical applications.
- Document known errors and play important role in Knowledge transfer to Level 1 team.
- Reduce escalations to Level 3 based on incremental learning about applications.
Must have technical skills / experience (ask for alternative / tool / version) :
- SRE - Network Engineering & Architecture
- Technical Project management
- Deep Understanding of Networking Protocols, security, switching & routing, wireless, voip, cloud networking, network management and monitoring
- Understanding of SRE concepts and a proven experience working on automation or application development using any programing language.
- Solid technical skills including knowledge of client server technology, networking basics, database technology, end to end understanding of 3-tier application architecture (frontend - application server - database).
19 days ago