Senior Site Reliability Engineer2k • Novato, California, United States

Senior Site Reliability Engineer

2k • Novato, California, United States

30+ days ago

Job type

Full-time

Job description

#LI-Onsite

On-Call Requirement : Yes (Periodic Rotation)

Who We Are

2K is headquartered in Novato, California and is a wholly owned label of Take-Two Interactive Software, Inc. (NASDAQ : TTWO). Founded in 2005, 2K Games is a global video game company, publishing titles developed by some of the most influential game development studios in the world. Our studios responsible for developing 2K’s portfolio of world-class games across multiple platforms, include Visual Concepts, Firaxis, Hangar 13, CatDaddy, Cloud Chamber, 31st Union, HB Studios, and 2K SportsLab. Our portfolio of titles is expanding due to our global strategic plan, building and acquiring exciting studios whose content continues to inspire all of us! 2K publishes titles in today’s most popular gaming genres, including sports, shooters, action, role-playing, strategy, casual, and family entertainment.

Our team of engineers, marketers, artists, writers, data scientists, producers, thinkers and doers, are the professional publishing stewards of 2K’s portfolio currently includes several AAA, sports and entertainment brands, including global powerhouse NBA®️ 2K, renowned BioShock®️, Borderlands®️, Mafia, Sid Meier’s Civilization®️ and XCOM®️ brands; popular WWE®️ 2K and WWE®️ SuperCard franchises, TopSpin 2K25, as well as the critically and commercially acclaimed PGA TOUR®️ 2K

At 2K, we pride ourselves on creating an inclusive work environment, which means encouraging our teams to Come as You Are and do your best work! We encourage ALL applicants to explore our global positions, even if they don’t meet every requirement for the role. If you're interested in the job and think you have what it takes to work at 2K, we encourage you to apply!

What We Need

We are seeking a Senior Site Reliability Engineer (SRE) with deep expertise in Unix / Linux systems architecture, distributed infrastructure, and automation tooling to help scale and sustain mission-critical platforms that serve millions of active users worldwide. You’ll play a leading role in building resilient, high-performance services for live gaming environments—balancing system stability, scalability, and operational velocity.

As part of our SRE team, you’ll work across a complex technology stack spanning AWS, GCP, and hybrid on-prem environments. You’ll be responsible for building auto-scaling, self-healing Unix-based systems, optimizing OS internals, and integrating authentication across enterprise identity systems. You’ll lead the design of high-availability architecture, implement disaster recovery, apply advanced performance tuning across kernel, network, and filesystem layers, and define / enforce observability standards using Datadog, Grafana, and open-source telemetry tools. Your efforts will power real-time insights, automated alerting, and rapid incident detection and resolution. As a senior member of the on-call rotation, you’ll handle critical outages, lead post-mortems, and design long-term preventative solutions.

Automation is foundational to this role. You’ll build and maintain infrastructure-as-code (IaC) with tools like Terraform, puppet, and Ansible, orchestrating deployments, configurations, and updates across heterogeneous environments. You’ll extend platform APIs and backend tooling using Python, and Shell scripts, driving continuous improvement in platform delivery.

Collaboration is key : You’ll partner with backend and gameplay engineers to embed reliability into every layer of the tech stack. You’ll contribute to shared reliability standards, CI / CD integration pipelines, provisioning templates, and internal documentation. As a mentor, you’ll share your expertise in debugging, system architecture, and tooling best practices, empowering engineers across disciplines to build complex resilient systems.

What You’ll Do

Systems Design, Scaling & Resilience

Design and operate distributed Unix-based systems (Red Hat, Ubuntu, Debian, CentOS).
Implement auto-scaling and self-healing infrastructure to ensure uptime and durability.
Tune system internals including kernel parameters, networking, and filesystems for high performance.
Maintain timely OS patching and compliance posture across environments.
Integrate systems with enterprise identity services such as Active Directory, LDAP, and Kerberos.

Automation & Infrastructure as Code

Build and maintain infrastructure automation using Terraform, puppet, Ansible.

Automate deployment pipelines, service configurations, and patch management.

Develop scripts and services in Python, and Bash / Shell to enhance infrastructure delivery workflows.

Extend APIs and platform automation to drive efficiency and repeatability.

Observability, Monitoring & Incident Response

Develop observability stacks using Datadog, Prometheus, Grafana, and open-source telemetry tools.

Create dashboards and SLO / SLI-based alerts for real-time monitoring of production systems.

Participate in a global 24 / 7 on-call rotation, leading response for high-severity incidents.

Conduct post-incident analysis (RCA) and drive remediations that improve long-term reliability.

Multi-Cloud & Hybrid Platform Engineering

Manage workloads across AWS, GCP, and on-prem infrastructure.

Design and implement multi-region failover, load balancing, and disaster recovery strategies.

Work with both VM-based and containerized / Kubernetes platforms including vSphere / VMware.

Support backup, restore, and DR tooling with strict availability targets.

Collaboration, Standards & Enablement

Partner with development teams to embed reliability in deployment pipelines.

Help define system architecture standards and maintain robust platform documentation.

Mentor engineers in Unix performance, observability, and debugging practices.

Champion a culture of automation, resilience, and continuous improvement.

What Will Make You A Great Fit

7+ years in SRE, Infrastructure, or Systems Engineering roles managing production services.

Deep expertise with Unix / Linux systems including Red Hat, Debian, Ubuntu, and CentOS.

Experience in kernel tuning, performance profiling, and debugging complex system issues.

6+ years working in AWS and / or GCP with large-scale, distributed applications.

Advanced skills in Python, Shell scripting, and optionally Go or Ruby.

Strong grasp of IaC tools like Terraform, Ansible, and puppet.

Experience running hybrid infrastructure (cloud / on-prem) with VMware, containers, and Kubernetes.

Hands-on experience with monitoring, telemetry, and observability stacks.

Additional qualities

Experience supporting live game services or other high-throughput, low-latency platforms.

Contributions to open-source tooling in observability, automation, or infrastructure domains.

Familiarity with telemetry pipelines like ETL, Flink, Kafka, or Kinesis.

Experience with Kubernetes-native tooling and service meshes (e.g., Istio, Linkerd).

Operational knowledge of MySQL / Postgres in cloud-native and bare-metal deployments.

You thrive in collaborative environments that value technical skill and operational excellence. Your passion for high-quality infrastructure empowers development teams and enhances productivity.

As an equal opportunity employer, we are committed to ensuring that qualified individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform their essential job functions, and to receive other benefits and privileges of employment. Please contact us if you need reasonable accommodation.

Please note that 2K Games and its studios never uses instant messaging apps or personal email accounts to contact prospective employees or conduct interviews and when emailing, only use 2K.com accounts.

The pay range for this position in California at the start of employment is expected to be between $98,400 and $145,620 per Year. However, base pay offered is based on market location, and may vary further depending on individualized factors for job candidates, such as job-related knowledge, skills, experience, and other objective business considerations. Subject to those same considerations, the total compensation package for this position may also include other elements, including a bonus and / or equity awards and eligibility to participate in our 401(K) plan and Employee Stock Purchase Program. Regular, full-time employees are also eligible for a range of benefits at the Company, including : medical, dental, vision, and basic life insurance coverage; 14 paid holidays per calendar year; paid vacation time per calendar year (ranging from 15 to 25 days) or eligibility to participate in the Company’s discretionary time off program; up to 10 paid sick days per calendar year; paid parental and compassionate leave; wellbeing programs for mental health and other wellness support; family planning support through Maven; commuter benefits; and reimbursements for fitness-related expenses.

Create a job alert for this search

Senior Site Reliability Engineer • Novato, California, United States

Related jobs

Site Reliability Engineer

DevOps projects • Berkeley, CA, United States

Full-time

LMArena is an engineering-first startup redefining how the world evaluates large language models.Created in 2023 by UC Berkeley researchers, our neutral, community-driven benchmarking platform attr...Show more

Last updated: 7 hours ago • Promoted • New!

Senior Site Reliability Engineer

Chainlink Labs • San Francisco, CA, United States

Full-time

Chainlink Labs is the primary contributing developer of Chainlink, the decentralized computing platform powering the verifiable web. Chainlink is the industry-standard platform for providing access ...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

ConductorOne • San Francisco, CA, United States

Full-time

ConductorOne is the first AI-native identity security platform that protects every identity : human, non-human, and AI.With powerful automation, platform-level AI, and out-of-the-box connectors, it ...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer – Platform

Icon Ventures • San Francisco, CA, United States

Full-time

At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.We blend cognitive science with machine learning to personalize and enhance the lear...Show more

Last updated: 1 day ago • Promoted

Site Reliability Engineer I

Prosper • San Francisco, CA, United States

Full-time

As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...Show more

Last updated: 23 days ago • Promoted

Senior Site Reliability Engineer

Corelight • San Francisco, CA, United States

Full-time

Senior Site Reliability Engineer.We are looking for a Senior Site Reliability Engineer to design, automate, and scale cloud and hybrid platforms that power AI / ML workloads and SaaS services.You\'ll...Show more

Last updated: 4 days ago • Promoted

Senior Staff Site Reliability Engineer - Platform

Icon Ventures • San Francisco, CA, United States

Full-time

At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.Our $1B+ learning platform serves tens of millions of students every month, includin...Show more

Last updated: 1 day ago • Promoted

Site Reliability Engineer

Redwood Materials, Inc. • San Francisco, CA, United States

Full-time

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Runloop AI • San Francisco, CA, United States

Full-time

Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...Show more

Last updated: 27 days ago • Promoted

Senior Site Reliability Engineer

Alembic • San Francisco, CA, United States

Full-time

We’re looking for an experienced.Site Reliability Engineer (SRE).You’ll partner with engineers and data scientists to build, automate, and maintain the infrastructure that powers our core platform—...Show more

Last updated: 2 days ago • Promoted

Senior Site Reliability Engineer

Alembic Technologies • San Francisco, CA, United States

Full-time

Senior Site Reliability Engineer.This range is provided by Alembic Technologies.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.We’re looking fo...Show more

Last updated: 12 hours ago • Promoted • New!

Senior Site Reliability Engineer

Loft Orbital • San Francisco, CA, United States

Full-time

Senior Site Reliability Engineer.This range is provided by Loft Orbital.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Loft Orbital is revoluti...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

Checkr • San Francisco, CA, United States

Full-time

Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...Show more

Last updated: 1 day ago • Promoted

Senior Site Reliability Engineer

Hive • San Francisco, CA, United States

Full-time

Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...Show more

Last updated: 30+ days ago • Promoted

Senior Site Reliability Engineer

AppOmni • San Francisco, CA, United States

Full-time

AppOmni, a leader in SaaS Security, helps customers achieve secure productivity with their applications.Security teams and owners can quickly detect and mitigate threats using unmatched depth of pr...Show more

Last updated: 4 days ago • Promoted

Site Reliability Engineer

Sigmaways Inc • San Francisco, California, United States

Full-time

As a Site reliability engineer, you will partner with development and IT teams to implement CI / CD pipelines, develop automation and monitoring solutions to ensure our platforms are secure, scalable...Show more

Last updated: 1 day ago • Promoted

Senior Site Reliability Engineer

Gridware Technologies Inc. • San Francisco, CA, United States

Full-time

Gridware is a San Francisco-based technology company dedicated to protecting and enhancing the electrical grid.We pioneered a groundbreaking new class of grid management called active grid response...Show more

Last updated: 30+ days ago • Promoted

Site Reliability Engineer

Speak • San Francisco, CA, United States

Full-time

Our mission is to reinvent the way people learn, starting with language.Learning a language can change a life by opening doors to new cultures, careers, and communities. Two billion people around th...Show more

Last updated: 1 day ago • Promoted