Site Reliability Engineer, GNC (Falcon)
SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. Today SpaceX is actively developing the technologies to make this possible, with the ultimate goal of enabling human life on Mars.
Site Reliability Engineer, GNC (Falcon)
SpaceX is looking for a Site Reliability Engineer to operate and scale custom-built mission-critical products for Guidance, Navigational, and Control (GNC). The GNC team performs trajectory design and vehicle simulation and participates in recurring mission-critical launch operations. This position will work with the GNC team to maintain and improve a set of GNC-focused tools. Examples of these products include Monte Carlo simulations on a high-performance computing cluster, automated data analysis systems, continuous integration systems for rocket and simulation software, GNC analysis infrastructure, and vehicle configuration verification tools. The ideal candidate will be flexible, possess broad skills across product operations and software development, and flourish in a fast-paced and challenging environment.
Responsibilities :
- Deploy, upgrade, operate / maintain, and scale a suite of mission-critical GNC products and services
- Provision and maintain virtual and physical servers
- Work with SpaceX HPC team to monitor and maintain a 4000+ thread HPC cluster
- Closely collaborate with GNC software engineers to create highly operable and maintainable products
- Add monitoring for web apps and respond to outages
- Manage the underlying computational infrastructure of GNC in collaboration with IT
- Engage in and improve the whole lifecycle of services : from inception and design, through deployment, operation and refinement
- Make recommendations for future hardware purchases
- Practice sustainable incident response and postmortems
- Provide end-user support to GNC engineering for products by becoming an expert on analysis applications and support users in troubleshooting and pointing to features
- Configure automated deployment pipelines for web apps
- Develop or improve GNC web apps and tools for better usability, maintainability, and robustness
- Demo and document new software changes such as operating system upgrades, shared filesystem changes, or major tool rollouts
- Focus on performance bottlenecks and performance improvement techniques
Basic Qualifications :
Bachelor's degree in computer science, information systems / IT, engineering, math, or scientific discipline and 2+ years of software development experience OR 4+ years of professional experience building software with site reliability or DevOps in lieu of a degreeExperience with Linux operating systemsExperience with Python and Python based development frameworksPreferred Skills and Experience :
2+ years of systems administration, site reliability engineering, or DevOps experience2+ years of experience with Python and Python-based development frameworks2+ years of Linux experienceExpertise with Docker, Vagrant, and Kubernetes or similar technologiesExtensive Experience with configuration management tools such as Ansible, Puppet, TerraformExperience with build systems (Make, Bazel / Pants / Buck, Gradle) and package management tools (pip, npm)Strong understanding of virtualization and hypervisor technologiesUnderstanding of databases and data modelingExperience with automatically managing dozens or hundreds of serversStrong networking knowledge of TCP / IPExperience scaling web applications and optimizing applications for performanceProfessional experience with standard front-end technologies like modern HTML, CSS, JavaScript (we use AngularJS, Polymer, Backbone.js, React, and more), REST, JSONSolid understanding of UI / UX design to provide intuitive applicationsExperience with high-performance computing systems or large-scale data analysis systemsMust be comfortable working with mission-critical and sensitive systems, with a sense of urgency appropriate to the responsibilitiesAdditional Requirements :
Willing to work extended hours and weekends when needed to meet critical deadlinesCompensation and Benefits :
Pay Range : Site Reliability Engineer / Level I : $120,000.00 - $145,000.00 / per year Site Reliability Engineer / Level II : $140,000.00 - $170,000.00 / per year
Your actual level and base salary will be determined on a case-by-case basis and may vary based on the following considerations : job-related knowledge and skills, education, and experience.
Base salary is just one part of your total rewards package at SpaceX. You may also be eligible for long-term incentives, in the form of company stock, stock options, or long-term cash awards, as well as potential discretionary bonuses and the ability to purchase additional stock at a discount through an Employee Stock Purchase Plan. You will also receive access to comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short and long-term disability insurance, life insurance, paid parental leave, and various other discounts and perks. You may also accrue 3 weeks of paid vacation and will be eligible for 10 or more paid holidays per year. Employees accrue paid sick leave pursuant to Company policy which satisfies or exceeds the accrual, carryover, and use requirements of the law.