Overview
Site Reliability Engineer (Space Communications) at Northwood. Join to help build and maintain observability infrastructure and ensure the global space communications network operates reliably as we scale ground stations around the world.
Responsibilities
- Build and maintain observability stack with tools like Grafana, Prometheus, Loki, Vector, CloudWatch, VictoriaMetrics, etc. for metrics and log ingestion across environments
- Support and improve CI / CD pipelines using GitLab and ArgoCD, collaborating with development teams on deployment best practices
- Help build and maintain cloud infrastructure using Terraform on AWS, contributing to the scalability and reliability of space communication systems
- Work with senior engineers to establish monitoring strategies, alerting, and incident response procedures
- Deploy and manage Kubernetes applications using Helm charts, focusing on reliability and developer experience
- Collaborate with engineering teams to implement performance monitoring and troubleshooting across microservices
- Support identity and access management integration with Okta and HashiCorp Vault
- Assist in managing NixOS-based infrastructure for reproducible system configurations
- Participate in incident response efforts and contribute to post-incident reviews and improvements
Basic Qualifications
2-4 years of hands-on experience with infrastructure tools and monitoring systems in production environmentsExperience with containerization (Docker, Kubernetes) and basic container orchestrationFamiliarity with CI / CD tools (GitLab, Jenkins, or similar) and infrastructure as code conceptsExperience with cloud platforms (AWS preferred) and basic infrastructure automationProgramming skills in Python or similar language and experience with configuration managementStartup mentality with ability to work in fast-paced, high-growth environments and take on diverse responsibilitiesExperience with logging and metrics collection for production systemsUnderstanding of system reliability principles and interest in learning SRE practicesPreferred Qualifications
Some exposure to observability tools like Vector, Loki, Grafana, Prometheus, or similar monitoring systemsExperience with Terraform or other infrastructure as code toolsFamiliarity with NixOS or other declarative system configuration approachesBasic knowledge of HashiCorp Vault, Okta, or similar identity / secrets management toolsInterest in distributed systems and troubleshooting complex technical issuesPrevious startup experience or demonstrated ability to learn quickly and adaptLinux system administration experienceAWS certification or demonstrated cloud platform knowledgeAdditional Information
To conform to U.S. Government space technology export regulations, including the International Traffic in Arms Regulations (ITAR) you must be a U.S. citizen, lawful permanent resident of the U.S., protected individual as defined by 8 U.S.C. 1324b(a)(3), or eligible to obtain the required authorizations from the U.S. Department of State.
Northwood is an Equal Opportunity Employer; employment with Northwood is governed on the basis of merit, competence and qualifications and will not be influenced in any manner by race, color, religion, gender, national origin / ethnicity, veteran status, disability status, age, sexual orientation, gender identity, marital status, mental or physical disability or any other legally protected status.
#J-18808-Ljbffr