Talent.com
Reliability Engineer
Reliability EngineerPrestige Development Group • Ashburn, VA, United States
No longer accepting applications
Reliability Engineer

Reliability Engineer

Prestige Development Group • Ashburn, VA, United States
8 days ago
Job type
  • Full-time
  • Quick Apply
Job description

Job Title : Reliability Engineer

Location : On-Site

Job Type : Full-Time

About Prestige Development Group (PDG)

Prestige Development Group (PDG) specializes in providing innovative human capital management solutions tailored to meet the needs of both private and public sector organizations. We are a certified SBA HUBZone and Economically Disadvantaged Woman-Owned Small Business dedicated to fostering diversity, inclusion, and operational excellence.

Position Summary

We are seeking a skilled Reliability Engineer to support our client's mission by enhancing Production Monitoring and ensuring optimal service delivery for their applications. This role involves proactive issue identification, incident resolution, and system health optimization within a 24x7x365 operational environment. The ideal candidate will lead monitoring solutions, manage ITIL engineers, automate processes, and collaborate across IT and business teams to improve service reliability. Expertise in AWS environments, root cause analysis, and technical troubleshooting is essential, along with strong communication and leadership skills to drive continuous improvement.

Key Responsibilities

  • Proactive and early notification of potential and actual issues impacting service delivery.
  • Frequent and succinct communication to PSPD leadership during and post incident.
  • Identification of trends and corrective measures.
  • Provide needed metrics to PSPD leadership team.
  • The enhanced Production Monitoring Services Branch will provide resources to staff the operation 24x7x365. The resources should provide additional technical support and diagnosis.

Customer Facing :

  • Build monitoring and production support solutions to provide customer with visibility towards our services.
  • Manage ITIL engineers.
  • Triage and resolve production incidents related to the cloud platform and participate in root cause analysis and postmortem discussions.
  • Function as a solution manager in support of the Manager, Production Support by leading the implementation of short-term and long-term solutions, automating manual processes, and building alerts to monitor the operation of services.
  • Asses initial severity, gather impacts, create tickets, engage support teams, and escalate issues properly as they arrive.
  • Optimizes Work Processes :

  • Participate in the creation and maintenance of technical and knowledge base documentation.
  • Troubleshoot production issues problems and collaborate in developing simple technical solutions.
  • Use diagnostic tools to maintain, troubleshoot and restore standard service or data to systems.
  • Lead Implementation of production support activities in an Amazon Web Services environment.
  • Lead technical and design discussions with IT to help enterprises speed their adoption of new technologies and practices.
  • Perform System health monitoring and optimizing performance
  • Define and establish monitoring and other processes and tooling for monitoring and performing routine system health checks to ensure optimization and stability of application.
  • Collaborates :

  • Work as a technical leader alongside business, development, and infrastructure teams.
  • Effectively work with IT and business teams, as well as external customers, to lead the resolution of production incidents and provide communication during outage.
  • Collaborate with other members of IT and business in streamlining production support processes.
  • Work closely with other teams and recommend solutions to improve production support current processes that reflect business needs, security, and SLAs of our production services.
  • Work closely with Infrastructure team and other support staff to identify and resolve incidents and create and implement long term remediation techniques and fixes.
  • Provide support and coach other members of the Production Support team.
  • Communicates Effectively :

  • Communicate clearly and effectively across IT, business process owners, and customers at all levels of the organization.
  • Communicate progress and any challenges to management.
  • Communicate overall status and health of the application to business and application support teams.
  • Active CBP / BI or Top Secret clearance is highly desired. Must be open to working 2nd or 3rd shift in a 24 / 7 / 365 environment.

    Qualifications

    Required :

  • Experience in Production Monitoring & Support within a 24x7x365 operational environment.
  • Strong expertise in incident management, root cause analysis, and problem resolution for cloud-based applications.
  • Hands-on experience with Amazon Web Services (AWS) and cloud-based monitoring tools.
  • Proficiency in ITIL processes and managing ITIL engineers for efficient service delivery.
  • Ability to build and implement monitoring solutions, automate manual processes, and create alerts to ensure system stability.
  • Experience with system health monitoring, performance optimization, and troubleshooting production issues.
  • Strong leadership skills to collaborate with IT, business, and infrastructure teams to improve production support processes.
  • Effective communication skills to provide updates, incident reports, and status updates to leadership and stakeholders.
  • Ability to develop and maintain technical documentation and knowledge base resources for production support.
  • Experience in triaging and resolving production incidents, assessing severity, and properly escalating issues.
  • Equal Employment Opportunity (EEO) Statement

    Prestige Development Group (PDG) is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. PDG prohibits discrimination and harassment of any kind, including based on race, color, religion, sex, pregnancy, sexual orientation, gender identity, national origin, age, disability, genetic information, or any other protected characteristic as outlined by federal, state, or local laws.

    Americans with Disabilities Act (ADA) Statement

    PDG is committed to providing reasonable accommodations for individuals with disabilities in our job application and hiring process.

    Background Check Policy

    Employment is contingent upon the successful completion of a background check. PDG complies with all applicable laws regarding background checks.

    How to Apply

    Interested candidates are encouraged to submit their resume. Applications will be reviewed on a rolling basis until the position is filled.This template ensures compliance with major federal and state-specific labor laws, incorporates diversity and inclusivity practices, and aligns with standard job description structures. Adjustments may be made based on specific job roles and legal requirements in certain states.

    Create a job alert for this search

    Reliability Engineer • Ashburn, VA, United States

    Related jobs
    Staff Site Reliability Engineer

    Staff Site Reliability Engineer

    Visa • Ashburn, VA, United States
    Full-time
    Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more t...Show more
    Last updated: 30+ days ago • Promoted
    Sr. Site Reliability Engineer I (6056)

    Sr. Site Reliability Engineer I (6056)

    Metrostar • Quantico, Virginia, United States
    Full-time
    This role focuses on automation, monitoring, and incident response across production and support tiers.The ideal candidate will bring strong scripting skills, cloud experience, and a proactive appr...Show more
    Last updated: 30+ days ago • Promoted
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    Federated IT • Washington, DC, United States
    Full-time
    Bridge Defense is redefining how modern defense technology is delivered.Department of Defense, the Intelligence Community, and federal law enforcement agencies. We provide full-spectrum national sec...Show more
    Last updated: 8 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Leidos Inc • Reston, VA, United States
    Full-time
    The Multi Domain Solutions Division at Leidos is looking for a.This role involves supporting the delivery of comprehensive IT and support services to ensure mission success while adhering to DoD st...Show more
    Last updated: 21 days ago • Promoted
    Engineer, Strategic / Reliability

    Engineer, Strategic / Reliability

    Constellation Energy • Olney, MD, US
    Full-time
    As the nation's largest producer of clean, carbon-free energy, Constellation is focused on our purpose : accelerating the transition to a carbon-free future. We have been the leader in clean ener...Show more
    Last updated: 1 day ago • Promoted
    Sr. Manager - Site Reliability Engineer

    Sr. Manager - Site Reliability Engineer

    Visa • Ashburn, VA, United States
    Full-time
    Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more t...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer, Platform Discovery

    Site Reliability Engineer, Platform Discovery

    Slope • Washington, DC, United States
    Full-time
    Anduril Industries is a defense technology company with a mission to transform U.By bringing the expertise, technology, and business model of the 21st century’s most innovative companies to the def...Show more
    Last updated: less than 1 hour ago • Promoted • New!
    Reliability Engineer

    Reliability Engineer

    Lockheed Martin • Bethesda, MD, United States
    Full-time
    Lockheed Martin is a global security and aerospace company that employs some of the greatest minds in the industry.They are passionate about purposeful innovation, dedicated to keeping people safe ...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Accenture Federal Services • Reston, Virginia, United States
    Full-time
    At Accenture Federal Services, nothing matters more than helping the US federal government make the nation stronger and safer and life better for people. Our 13,000+ people are united in a shared pu...Show more
    Last updated: 30+ days ago • Promoted
    Senior Reliability Engineer

    Senior Reliability Engineer

    The Johns Hopkins University Applied Physics Laboratory • Laurel, MD, United States
    Full-time
    Are you passionate about applying reliability and system engineering principles to analyze and assess the resilience of future strategic weapon systems?. Do you have a strong technical background in...Show more
    Last updated: 11 days ago • Promoted
    Reliability Engineer

    Reliability Engineer

    NewGen Technologies • Washington, DC, United States
    Full-time
    Evaluate and analyze products, components, materials, and equipment for the purpose of understanding and predicting failures. Review product designs, material specifications, and manufacturing capab...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Visa • Ashburn, VA, United States
    Full-time
    Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more t...Show more
    Last updated: 30+ days ago • Promoted
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    Bridge Defense • Washington, DC, United States
    Full-time
    Bridge Defense is redefining how modern defense technology is delivered.Department of Defense, the Intelligence Community, and federal law enforcement agencies. We provide full-spectrum national sec...Show more
    Last updated: 8 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    EngFlow • Washington, DC, United States
    Full-time
    Join to apply for the Site Reliability Engineer role at EngFlow.At EngFlow, we help developers save time by accelerating software builds and tests. Our cloud-based, distributed service optimizes dev...Show more
    Last updated: 8 days ago • Promoted
    Sr. Site Reliability Engineer

    Sr. Site Reliability Engineer

    Visa • Ashburn, VA, United States
    Full-time
    Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more t...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Tax Analysts • Falls Church, Virginia, United States
    Full-time
    Tax Analysts is seeking a Site Reliability Engineer (SRE) to help establish and shape our reliability engineering practice from the ground up. This is a unique opportunity to join a mission-driven o...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Cape • Washington, DC, United States
    Full-time
    Cape was founded in early 2022 by Palantir and Anduril alums with deep expertise in privacy and national security.While running Palantir’s US national security business, our CEO became passionate a...Show more
    Last updated: 8 days ago • Promoted
    Site Reliability Engineer III

    Site Reliability Engineer III

    Verisign • Reston, Virginia, United States
    Full-time
    Verisign helps enable the security, stability, and resiliency of the internet.We are a trusted provider of internet infrastructure services for the networked world and deliver unmatched performance...Show more
    Last updated: 30+ days ago • Promoted