Talent.com
No longer accepting applications
Site Reliability Engineer Cloud Platform (Chicago)

Site Reliability Engineer Cloud Platform (Chicago)

Beacon HillChicago, IL, US
22 hours ago
Job type
  • Part-time
Job description

Are you looking to join a dynamic SRE team at a leading organization transitioning to a product-focused model?

As the third SRE on a growing team (expanding to six by year-end), you'll lead the implementation of robust monitoring, reliability, and performance solutions for our frontend portal on Google Cloud Platform (GCP).

Collaborate closely with engineering and business stakeholders to roll off external support, enhance observability, and drive proactive incident management. This is a high-impact role in a motivated team environment, with an ASAP start date to meet critical project timelines.

Location : Buffalo Grove, IL or Hartford, CT (Hybrid : 2-3 days onsite per week)

Employment Type : Contract-to-Hire (90-day contract period)

Compensation : Competitive hourly rate during contract; with a high base salary upon conversion

Key Responsibilities :

  • Design and implement comprehensive SRE monitoring solutions for web portals on GCP, including JVM metrics collection and performance tuning for Java applications.
  • Set up logging and tracing standards using Cloud Logging, Cloud Trace, and distributed tracing with W3C Trace Context headers and OpenTelemetry.
  • Configure APIGEE for API monitoring, rate limiting, security, and performance tracking.
  • Develop drill-down dashboards correlating metrics, logs, and traces using GCP tools, Prometheus, and Grafana.
  • Integrate Google Managed Prometheus (GMP) for enhanced metrics and create RED (Request, Error, Duration) dashboards for production environments.
  • Implement UI zero-code instrumentation for frontend monitoring, including user session tracking and end-to-end traceability from UI to backend.
  • Build and maintain automation scripts (Python, Bash) for GKE namespaces, CI / CD pipelines, and alerting workflows.
  • Establish performance baselines using 2-4 weeks of historical data and configure alerting policies with escalation procedures.
  • Ensure structured logging (JSON format with trace_id, service.name, etc.) and service health dashboards with error analysis.

What We're Looking For :

  • 5+ years in SRE / DevOps roles, with hands-on experience in JVM monitoring, APIGEE, GCP observability, Grafana stack, GKE, OpenTelemetry, and UI instrumentation.
  • Core Technical Skills : Proficiency in Python, Linux, Prometheus, Grafana, Kubernetes (GKE), Docker, Loki, and Tempo. Strong Kubernetes expertise, including namespace management, RBAC, and tool deployment via code (YAML, Helm).
  • Observability & Querying : Expertise in PromQL for query writing, metric aggregation, SLO calculations, and alerting; familiarity with InfluxQL.
  • GCP-Specific Skills (Critical) : Google Cloud Monitoring (dashboards, alerting policies), Cloud Logging (centralized logging, log-based metrics), and OpenTelemetry instrumentation / collectors.
  • Logging & Tracing : Experience with Splunk, distributed tracing, log aggregation, correlation IDs, and structured logging standards.
  • API & Infrastructure : APIGEE for API management; CI / CD pipelines and AI-assisted tools (e.g., Git Copilot).
  • UI / Frontend Monitoring (Critical) : UI span management, W3C Trace Context for frontend, component-level monitoring, and cross-platform tracing.
  • Top Focus Areas : Grafana dashboard creation and visualization; GCP Metrics Explorer for monitoring / alerting; Loki for log management and troubleshooting; Tempo for distributed tracing and bottleneck identification; Automation & Alerts integration with ServiceNow for incident workflows. (Strong candidates need 2-3 of these.)
  • Excellent problem-solving skills, with a passion for reliability engineering in fast-paced, collaborative settings. W2 eligible; no visa sponsorship.
  • Why Join Our Client? Be part of a team integrating SRE practices into product domains, working on cutting-edge projects like OpenTelemetry and APIGEE in GCP. Enjoy hybrid flexibility, rapid team growth, and the chance to influence reliability at scale. Interviews are virtual (1-2 rounds with manager and SRE peers), and we're moving quicklyaiming to fill by end of October!

    Beacon Hill is an Equal Opportunity Employer that values the strength diversity brings to the workplace. Individuals with Disabilities and Protected Veterans are encouraged to apply.

    Completion of this form is voluntary and will not affect your opportunity for employment, or the terms or conditions of your employment. This form will be used for reporting purposes only and will be kept separate from all other records.

    California residents : Qualified applications with arrest or conviction records will be considered for employment in accordance with the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act.

    How to Apply : Submit your resume highlighting relevant SRE / DevOps experience. One professional reference is required. We review submissions on a rolling basis.

    Create a job alert for this search

    Site Reliability Engineer • Chicago, IL, US