Dive in and do the best work of your career at DigitalOcean. Journey alongside a strong community of top talent who are relentless in their drive to build the simplest scalable cloud. If you have a growth mindset, naturally like to think big and bold, and are energized by the fast-paced environment of a true industry disruptor, you’ll find your place here. We value winning together—while learning, having fun, and making a profound difference for the dreamers and builders in the world.
We are looking for a highly skilled Staff Software Engineer to join our Customer Observability/ Insights team. In this role, you’ll architect, build, and maintain large-scale distributed systems that power DigitalOcean’s Customer facing Observability ecosystem. You’ll collaborate across engineering, product, and design teams to deliver reliable, scalable, and developer-friendly solutions that help our teams monitor, measure, and optimize cloud infrastructure at scale.
What You’ll Be Doing
- Architect, design, develop, and maintain scalable backend services and systems.
- Drive technical initiatives and large cross-team projects from concept to production.
- Collaborate with product managers, UX designers, and engineers across distributed teams to deliver end-to-end solutions.
- Develop deep expertise in observability tools and technologies such as Prometheus, Grafana, time-series databases, and distributed tracing.
- Build and maintain high-performance APIs and microservices using Go (Golang) and gRPC, integrating with systems like Kafka, Redis, and NoSQL databases.
- Work with Terraform and Ansible to automate infrastructure deployment and configuration management.
- Utilize knowledge of SQL for data analysis, service integration, and operational insights.
- Lead efforts in debugging, troubleshooting, and performance tuning of complex distributed systems.
- Champion operational excellence by improving reliability, monitoring, and alerting practices.
- Provide technical leadership, mentorship, and guidance to other engineers.
What You’ll Bring to DigitalOcean
- 15+ years of relevant industry experience building and operating large-scale cloud services or distributed systems in a fast-paced, high-growth environment.
- Strong programming experience in Go (Golang) and deep understanding of distributed systems fundamentals.
- Solid understanding of observability, monitoring, and alerting systems (e.g., Prometheus, Grafana).
- Experience working with OTEL (OpenTelemetry) Collector, including instrumentation, data pipelines, and telemetry ingestion for metrics, logs, and traces.
- Proven experience designing and implementing scalable event-driven architectures using Kafka or similar technologies.
- Experience with gRPC, Terraform, and Ansible for service communication and infrastructure automation.
- Working knowledge of SQL, Redis, and NoSQL databases.
- Demonstrated ability to drive operational excellence and improve system reliability.
- Experience making pragmatic technical trade-offs while balancing short-term needs and long-term goals.
- Excellent communication and collaboration skills, especially with geographically distributed teams.
- Strong ownership mindset and the ability to independently deliver high-impact projects.
Nice to Have
- Experience with cloud-native environments (Kubernetes, Docker, microservices).
- Familiarity with time-series databases and distributed tracing frameworks.
- Prior experience building or maintaining observability platforms.
Compensation Range: