Lead Site Reliability Engineer

Bridge DefenseWashington, DC, United States

5 days ago

Job type

Full-time

Job description

About Bridge Defense.

Bridge Defense is redefining how modern defense technology is delivered. Based in Washington, D.C., we are built for the dynamic mission environment facing the Department of Defense, the Intelligence Community, and federal law enforcement agencies. We provide full-spectrum national security solutions that combine secure infrastructure, cleared talent, and mission-ready software to meet evolving defense challenges. Our services include secure software development in classified environments and the design and implementation of advanced IT and cybersecurity capabilities ranging from secure cloud architectures and enterprise infrastructure to data center operations, scientific analysis, and cutting‑edge cyber defense.

We are led by technologists and veterans with firsthand mission experience, which enables us to understand both the operational realities and the innovation needed to succeed. Our approach is agile and outcome‑based, delivering results in weeks rather than months whenever possible.

At Bridge Defense we value people, integrity, and excellence. We foster an environment where innovation thrives in support of traditional mission requirements. Our team members receive competitive compensation, robust benefits, professional development and certification opportunities, and clear paths for growth while working on the nation’s most critical projects.

Core Values :

Innovation & Responsiveness : We push beyond legacy models with efficient, tech‑led solutions built to scale and evolve.
Trusted Performance : Security, compliance, and deep experience in delivering to demanding environments guides all we do.
Mission Focused Expertise : From veteran leadership to cleared engineers, our people understand both the technology and the mission.

About the Role

As the Lead Site Reliability Engineer for our ComputeBridge Engagement, you’ll be responsible for the reliability, scalability, and performance of one of the largest hardware and AI infrastructure efforts in the U.S. defense sector. You will lead the deployment, management, and automation of a high‑performance computing mesh across multiple secure environments, ensuring operational excellence and mission continuity for a 9‑figure government program.

This is a hands‑on engineering leadership role that bridges physical infrastructure and modern DevOps automation, ideal for someone who thrives at the intersection of hardware systems, distributed computing, and AI / ML workflows.

What You’ll Do

Lead infrastructure design, deployment, and operations for ComputeBridge hardware clusters across secure and distributed environments

Install and configure physical systems, including high‑density GPU servers, networking gear, and storage arrays

Build and deploy secure Linux images and containerized workloads using OpenShift and other orchestration platforms

Develop and manage automation pipelines for provisioning, configuration management, and monitoring using modern DevOps toolchains (Ansible, Terraform, etc.)

Operate and maintain distributed networking meshes across multiple classified and unclassified domains

Implement and manage out‑of‑band management tools (IMPI, iDRAC, BMC, etc.) for remote troubleshooting and control

Integrate and optimize NVIDIA GPU infrastructure for AI / ML training and inference workloads

Collaborate with mission engineers, software teams, and government operators to ensure system readiness and performance

Provide on‑site technical leadership for deployments, troubleshooting, and continuous improvement

Mentor junior engineers and establish operational best practices across the ComputeBridge program as the contract grows

What You’ll Bring

3+ years of experience in site reliability, systems engineering, or hardware operations roles

Deep expertise with physical infrastructure : server racking, cabling, diagnostics, and troubleshooting

Strong experience with Linux systems administration, imaging, and automated deployment

Hands‑on experience managing large‑scale clusters or distributed systems in OpenShift or Kubernetes environments

Familiarity with DevOps automation (Ansible, Terraform, CI / CD pipelines)

Experience configuring and managing networking and mesh architectures

Direct experience with NVIDIA GPUs, CUDA, and related AI / ML frameworks

Proficiency with out‑of‑band management and IMPI / iDRAC tooling

Certifications : Linux+ and Security+ (required or in‑progress)

Excellent communication, documentation, and problem‑solving skills

Clearance : Active TS / SCI required or ability to obtain

Bonus Points For

Experience operating in secure DoD or intelligence environments

Familiarity with Palantir platforms or other government data systems

Prior experience supporting AI / ML infrastructure in production or tactical settings

Experience with performance tuning and monitoring of HPC or GPU‑accelerated clusters

General Factors :

Depending on project requirements, may be required to work within a compressed schedule; overtime should be expected when schedules demand it.

Willing to travel, if needed.

No Relocation.

Why Bridge Defense

Shape how advanced computing supports national security missions at scale

Lead engineering for a major government program with direct mission impact

Competitive compensation, benefits, and growth opportunities in a mission‑driven environment

Bridge Defense is committed to building a collaborative and mission‑focused team. Bridge Defense reserves the right to modify job duties or requirements at any time. Employment with Bridge Defense is at‑will. Candidates must be eligible to work in the United States and complete any required background checks or security clearance processes as a condition of employment.

#J-18808-Ljbffr

Create a job alert for this search

Site Reliability Engineer • Washington, DC, United States

Related jobs

Promoted

Staff Site Reliability Engineer

VisaAshburn, VA, United States

Full-time

Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more t...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer - Developer, Connected Warfare

Anduril IndustriesWashington, DC, United States

Full-time

Site Reliability Engineer, Connected Warfare.Washington, District of Columbia, United States.Anduril Industries is a defense technology company with a mission to transform U.By bringing the experti...Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Leidos IncReston, VA, United States

Full-time

The Multi Domain Solutions Division at Leidos is looking for a.This role involves supporting the delivery of comprehensive IT and support services to ensure mission success while adhering to DoD st...Show moreLast updated: 17 days ago

Promoted

Cloud Site Reliability Engineer (SRE) (Azure / AWS)

Leidos IncAlexandria, VA, United States

Full-time

Join us in transforming how technology serves those who serve.At Leidos, we're not just delivering solutions - we're pioneering the future of defense and intelligence technology.Our diverse teams o...Show moreLast updated: 14 days ago

Promoted

Sr. Manager - Site Reliability Engineer

VisaAshburn, VA, United States

Full-time

Promoted

Site Reliability Engineer III

VerisignReston, VA, United States

Full-time

Verisign helps enable the security, stability, and resiliency of the internet.We are a trusted provider of internet infrastructure services for the networked world and deliver unmatched performance...Show moreLast updated: 30+ days ago

Site Reliability Engineer

Tax AnalystsFalls Church, VA, US

Full-time

Quick Apply

Tax Analysts is seeking a Site Reliability Engineer (SRE) to help establish and shape our reliability engineering practice from the ground up. This is a unique opportunity to join a mission-driven o...Show moreLast updated: 30+ days ago

Promoted

Lead Site Reliability Engineer

Federated ITWashington, DC, United States

Full-time

Bridge Defense is redefining how modern defense technology is delivered.Department of Defense, the Intelligence Community, and federal law enforcement agencies. We provide full-spectrum national sec...Show moreLast updated: 6 days ago

Promoted

Site Reliability Engineer

Improvix TechnologiesWashington, DC, United States

Full-time

Site Reliability Engineer (SRE).We are seeking a Site Reliability Engineer (SRE) with strong GitLab expertise to support and enhance enterprise platforms. This role will focus primarily on GitLab wh...Show moreLast updated: 2 days ago

Promoted

Senior Reliability Engineer

The Johns Hopkins University Applied Physics LaboratoryLaurel, MD, United States

Full-time

Are you passionate about applying reliability and system engineering principles to analyze and assess the resilience of future strategic weapon systems?. Do you have a strong technical background in...Show moreLast updated: 8 days ago

Promoted

Site Reliability Engineer

CSCI ConsultingQuantico, VA, United States

Full-time

CSCI Consulting is looking for a.Site Reliability Engineer (SRE).This role combines deep systems engineering knowledge with DevOps automation, proactive monitoring, and incident response practices....Show moreLast updated: 30+ days ago

Promoted

Site Reliability Engineer

Powder River IndustriesWashington, DC, United States

Full-time

Conduct analysis of alternatives for configuration tools, make recommendations, work with team to design, develop, test, implement, and maintain tool choice. Responsible for the administration, moni...Show moreLast updated: 4 days ago

Promoted

Site Reliability Engineer

EngFlowWashington, DC, United States

Full-time

Join to apply for the Site Reliability Engineer role at EngFlow.At EngFlow, we help developers save time by accelerating software builds and tests. Our cloud-based, distributed service optimizes dev...Show moreLast updated: 4 days ago

Promoted

Principal Site Reliability Engineer (SRE) at Jobgether Washington DC

JobgetherWashington, DC, United States

Full-time

Principal Site Reliability Engineer (SRE) job at Jobgether.This position is posted by Jobgether on behalf of.We are currently looking for a. Principal Site Reliability Engineer (SRE).Join a high-imp...Show moreLast updated: 30+ days ago

Promoted

Deployment Site Reliability Engineer - Connected Warfare

Anduril Industries, Inc.Washington, DC, United States

Full-time

Senior Deployed Site Reliability Engineer, Connected Warfare.Washington, District of Columbia, United States.Anduril Industries is a defense technology company with a mission to transform U.By brin...Show moreLast updated: 4 days ago

Promoted

Site Reliability Engineer

CapeWashington, DC, United States

Full-time

Cape was founded in early 2022 by Palantir and Anduril alums with deep expertise in privacy and national security.While running Palantir’s US national security business, our CEO became passionate a...Show moreLast updated: 4 days ago

Promoted

Site Reliability Engineer — Scale mission-critical platforms

Anduril IndustriesWashington, DC, United States

Full-time

A defense technology company is seeking a Site Reliability Engineer in Washington, DC.The role involves solving challenges in networking and systems integration while working with cross-functional ...Show moreLast updated: 1 day ago

Promoted

Staff Site Reliability Engineer (Federal)

OktaWashington, DC, United States

Full-time

Okta is The World's Identity Company.We free everyone to safely use any technology, anywhere, on any device or app.Our flexible and neutral products, Okta Platform and Auth0 Platform, provide secur...Show moreLast updated: 30+ days ago