Talent.com
Senior Site Reliability Engineer
Senior Site Reliability EngineerSustainable Talent • Santa Clara, CA, United States
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Sustainable Talent • Santa Clara, CA, United States
2 days ago
Job type
  • Full-time
Job description

Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based onsite in Santa Clara, CA. We offer competitive pay $75 - $90 / hr based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company culture!

As an SRE, you will be troubleshooting and managing our client's on-premises infrastructure to support various software engineering teams' company wide. Keen attention to detail, problem-solving abilities, and a solid knowledge base are essential.

What you'll be doing :

  • Working on systems deployed in NVIDIA's internal cloud making them available and reliable for our end users.
  • Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization.
  • Providing high quality of user support.
  • Monitoring KPIs and making sure that team's SLAs are met.
  • Managing and maintaining production Kubernetes clusters.
  • Drive automation of monitoring to gain more insight into applications and system health.
  • Craft and implement critical metrics using various analytics methods and dashboards.
  • Reuse AI techniques to extract useful signals about machines and jobs from the data generated.

What we need to see :

  • Proven SRE experience as an L1 support with on-call responsibilities, ideally over 5+ years.
  • Proficient in troubleshooting Linux OS issues such as SSH and performance.
  • Experience troubleshooting networking issues like DNS, DHCP, and familiarity with networking principles and protocols, including TCP / IP and VLANs.
  • Hands-on experience with monitoring and alerting tools such as Prometheus, Grafana, Elastic, or similar.
  • Strong understanding and practical experience with REST API calls.
  • Proficiency in basic scripting, with familiarity in Python or similar programming languages being a plus.
  • Knowledge of Ansible roles and playbooks, Jenkins CI / CD processes, and deployment experience with Kubernetes.
  • Experience with the Kickstart process for automated Linux installations.
  • Experience managing and troubleshooting Linux systems, as well as managing systems in data centers, using tools like BMC (Redfish), KVM, and IPMI.
  • Background in databases such as SQL (MySQL) and timeseries DBs like Prometheus.
  • Experience with data analytics and visualization tools like Kibana, Grafana, and Splunk.
  • Proficient with source code management and binary repository systems like GitLab, GitHub, Artifactory, and Perforce.
  • Advanced knowledge of standard methodologies related to security.
  • Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience.
  • Ways to stand out from the crowd :

  • Working knowledge of OpenStack.
  • Previous experience managing NVIDIA hardware such as GPUs and Tegras.
  • Prior experience with large scale operations teams.
  • Experience managing Windows server infrastructure.
  • Outstanding interpersonal skills and ability to communicate effectively with all levels of management.
  • Ability to analyze complex problems, design simple systems that function efficiently with minimal support, and thrive in a multi-tasking environment with evolving priorities.
  • Sustainable Talent is a M / F+, disabled, and veteran equal employment opportunity and affirmative action employer.

    Create a job alert for this search

    Senior Site Reliability Engineer • Santa Clara, CA, United States

    Related jobs
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    NVIDIA • Santa Clara, CA, United States
    Full-time
    NVIDIA is looking for a Senior Site Reliability Engineer to work in IPP (Infrastructure, Planning and Process).IPP is a global organization within NVIDIA. This group works with various other groups ...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Compunnel • San Leandro, CA, United States
    Full-time
    We are seeking a Site Reliability Engineer (SRE) with a strong focus on observability as part of the Data Center exit program. The ideal candidate will have a passion for building and maintaining re...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantum • Palo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Rethink recruit • San Francisco, CA, United States
    Full-time
    Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Runloop AI, Inc • San Francisco, CA, United States
    Full-time
    Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Insight Global • Santa Clara, CA, United States
    Full-time
    Insight Global is looking for a seasoned SRE to join one of our largest technology clients' multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood Materials, Inc. • San Francisco, CA, United States
    Full-time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show more
    Last updated: 30+ days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Fortinet • Sunnyvale, CA, United States
    Full-time
    At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...Show more
    Last updated: 26 days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Zoox • San Mateo, CA, United States
    Full-time
    Zoox is looking for a platform / site reliability engineer who will be responsible for measuring and maintaining the uptime of the many services critical to the development process for autonomous veh...Show more
    Last updated: 14 hours ago • Promoted • New!
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood Materials • San Francisco, CA, United States
    Full-time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling.We are seeking a highly skilled and motivated Site Reliability Engineer to collect requ...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Runloop AI • San Francisco, CA, United States
    Full-time
    Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show more
    Last updated: 11 days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Tarana Wireless • Milpitas, CA, United States
    Full-time
    Join the Team That's Redefining Wireless Technology.At Tarana, we're more than just a fast-growing tech companywere a team of bold innovators on a mission to revolutionize broadband.Our groundbreak...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PSI Quantum • Palo Alto, CA, United States
    Full-time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...Show more
    Last updated: 30+ days ago • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Signify Technology • Atherton, CA, United States
    Full-time
    Senior Site Reliability Engineer.Competitive, based on experience.Join our innovative technology startup that is revolutionizing healthcare with a safety-focused AI platform.Our platform assists me...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Signify Technology • Palo Alto, CA, US
    Full-time
    Competitive, based on experience.We are a technology startup advancing healthcare with a safety-focused AI platform that assists medical professionals by managing patient communications, including ...Show more
    Last updated: 20 days ago • Promoted
    Site Reliability Engineer I

    Site Reliability Engineer I

    Prosper.com • San Francisco, CA, United States
    Full-time
    As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer - Supercomputing

    Site Reliability Engineer - Supercomputing

    Xai • Palo Alto, CA, United States
    Full-time
    Site Reliability Engineer - Supercomputing.We are seeking a talented Site Reliability Engineer (SRE) to join our SuperComputing team. In this role, you'll ensure the reliability, scalability, and pe...Show more
    Last updated: 2 days ago • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Rockwoods Inc • Pleasanton, CA, US
    Full-time
    Note : Candidates must have relevant experience in Medical / Healthcare domains, this is mandatory.Senior SRE Engineer - Pleasanton, 5 days office. Primary work : 24x7 On-call support and setting up mo...Show more
    Last updated: 20 days ago • Promoted