Hybrid Onsite - Irving, TX (3 days / week) Site Reliability Engineer
Contract through end of 2025, will extend
Responsibilities
- Design and execute performance tests to evaluate responsiveness, scalability, and stability of applications.
- Conduct resiliency testing to validate fault tolerance and recovery strategies.
- Implement and monitor observability tools to track system health and detect issues in real time.
- Perform capacity planning and recommend scaling strategies for peak loads.
- Collaborate with developers and operations teams to optimize Java / Spring Boot microservices , database queries, and infrastructure configurations.
- Configure Kubernetes performance parameters (resource limits, requests, autoscaling policies).
- Implement resiliency patterns such as circuit breakers, bulkheads, retries, rate limiters, and fallback mechanisms .
- Document methodologies and provide training on performance and resiliency best practices.
- Continuously evaluate and improve testing and monitoring processes.
Required Technical Skills
Programming : Strong experience with Java and Spring Boot for microservices.Containerization : Hands-on with Docker ; experience deploying and tuning containerized applications.Scripting : Proficiency in Python and Bash for automation and test scripting.Cloud : Solid experience with Azure (mandatory); familiarity with cloud-native architectures.Observability / APM Tools : Splunk, ELK stack, AppDynamics (setup, monitoring, troubleshooting).Architecture & Resiliency : Knowledge of design patterns, fault tolerance strategies, and distributed systems.Microservices Support : Strong background in supporting and optimizing microservices applications.Computer Science Fundamentals : Algorithms, data structures, and architectural design best practices.Preferred Skills
Experience with Kubernetes (cluster configuration, autoscaling, resource tuning).Understanding of networking concepts (DNS, load balancing, firewalls, VPNs).Exposure to CI / CD pipelines and DevOps practices.