Talent.com
Member of Technical Staff, DevOps / Infrastructure Engineering
Member of Technical Staff, DevOps / Infrastructure EngineeringFirstPrinciples • Oakland, CA, United States
No longer accepting applications
Member of Technical Staff, DevOps / Infrastructure Engineering

Member of Technical Staff, DevOps / Infrastructure Engineering

FirstPrinciples • Oakland, CA, United States
5 days ago
Job type
  • Full-time
Job description

Member of Technical Staff, DevOps / Infrastructure Engineering

FirstPrinciples is a non-profit organization building an autonomous AI Physicist designed to advance humanity's understanding of the fundamental laws of nature. Our goal for the AI Physicist is to achieve a breakthrough that unifies quantum field theory & general relativity and to explain the deepest unresolved phenomena in our universe by 2035. To do this, we're pioneering a new approach to scientific discovery by creating an intelligent system that can explore theoretical frameworks, reason across disciplines, and generate novel insights. We're a non-profit that operates like a tech start-up by moving quickly and continuously iterating to accelerate scientific progress. By combining AI, symbolic reasoning, and autonomous research capabilities, we're developing a platform that goes beyond analyzing existing knowledge to actively contribute to physics research.

We're seeking a Member of Technical Staff, DevOps / Infrastructure Engineering to architect, automate, and scale the infrastructure that underpins our large-scale model training and research workflows. This role spans both cloud environments (AWS) and HPC infrastructure (Buzz & Lambda HPC GPU clusters with high-speed interconnects), requiring you to design and codify the systems, pipelines, and automation that enable our researchers and engineers to move fast with confidence. Our ideal candidate brings strong fundamentals in Unix / Linux, deep experience in CI / CD and infrastructure-as-code, and a systems mindset to define standards, build automation, and grow our infrastructure practice from the ground up. You'll be instrumental in building the reliable, scalable foundation that powers our autonomous AI Physicist while partnering closely with training engineers and researchers to accelerate breakthrough scientific discoveries.

Key Responsibilities :

  • Design and run large-scale pre-training experiments for both dense and MoE architectures, from experiment planning through multi-week production runs.
  • Architect hybrid infrastructure solutions that span cloud and on-premises HPC environments seamlessly.
  • Automate configuration management and drift detection using tools like Ansible, Salt, or Chef.
  • Build systems that reduce operational toil and establish guardrails that let researchers focus on experiments, not operations.

CI / CD & Developer Experience :

  • Build and own comprehensive CI / CD pipelines for training workflows, evaluation jobs, internal tools, and services with rollback capabilities, observability, and safety built in.
  • Develop tooling for developer workflows including reproducible builds, ephemeral environments, secrets management, and cluster resource allocation.
  • Create self-service infrastructure patterns that empower researchers and engineers.
  • Design infrastructure that accelerates experimentation while maintaining reliability and reproducibility.
  • HPC & GPU Cluster Management :

  • Manage and extend HPC environments including GPU clusters, InfiniBand networks, job schedulers (Slurm / Kubernetes hybrid), and container orchestration.
  • Operate containerized and scheduled workloads efficiently across Docker, Kubernetes, and Slurm environments.
  • Optimize cluster scheduling and resource allocation for high-performance GPU workloads.
  • Debug GPU driver quirks, Slurm job issues, and InfiniBand networking hiccups as they arise.
  • Monitoring, Observability & Reliability :

  • Implement comprehensive monitoring, logging, and alerting across all infrastructure layers using Prometheus, Grafana, ELK / EFK, and OpenTelemetry.
  • Establish SLOs / SLIs for infrastructure reliability and create observability dashboards for long-horizon training runs.
  • Build observability stacks that provide visibility into both system health and job-level performance.
  • Proactively detect and resolve infrastructure issues before they impact research workflows.
  • Security & Compliance :

  • Implement and manage secrets management and identity security solutions (Vault, KMS, IAM).
  • Champion security best practices, IAM policies, and compliance standards across hybrid infrastructure.
  • Design infrastructure with least privilege principles and strong security hygiene from the start.
  • Maintain zero-trust security posture and comprehensive auditing capabilities.
  • Collaboration :

  • Partner closely with training engineers and researchers to translate research needs into robust infrastructure solutions.
  • Document best practices, create runbooks, and evangelize DevOps culture across the organization.
  • Mentor teammates on infrastructure patterns, automation techniques, and operational excellence.
  • Enable efficient pre-training runs and safe deployment of new infrastructure patterns through collaboration.
  • Qualifications :

  • Educational Background : Bachelor's or Master's degree in Computer Science, Engineering, or related field.
  • Experience : 6-10+ years in DevOps, Infrastructure, or SRE roles with proven hands-on systems engineering experience (not just certification-based).
  • Deep Unix / Linux administration expertise including kernel tuning, networking, storage, and process control.
  • Advanced Infrastructure-as-Code experience with Terraform, Pulumi, or CloudFormation.
  • Expertise building CI / CD systems and reproducible build pipelines (GitHub Actions, GitLab CI, Jenkins, etc.).
  • Hands-on experience with AWS (EC2, S3, IAM, VPC, etc.) and cloud infrastructure management.
  • Cluster orchestration and job scheduling experience with Kubernetes and Slurm.
  • Strong monitoring and observability stack experience (Prometheus, Grafana, ELK / EFK, OpenTelemetry).
  • Demonstrated success scaling infrastructure for high-performance or GPU workloads.
  • Track record of managing GPU-accelerated clusters or HPC infrastructure.
  • Experience in automating workflows that reduced toil and scaling deployments safely.
  • Skills : Strong programming skills in at least one compiled / systems language (Python, Go, or Rust) plus Bash fluency.
  • Collaboration & Communication : Ability to work cross-functionally. Strong communicator who can simplify complex topics for diverse audiences.
  • Mindset : Entrepreneurial & mission-driven, comfortable in a fast-growing, startup-style environment, and motivated by the ambition of tackling one of the greatest scientific challenges in history.
  • Demonstrated passion for physics and for making scientific knowledge accessible and impactful.
  • Bonus Skills :

  • Prior work with HPC vendors or AI compute providers (Buzz HPC, NVIDIA DGX, Lambda, CoreWeave).
  • Experience designing self-service infrastructure or internal developer platforms.
  • Deep familiarity with GPU cluster management, scheduling, and high-throughput networking (InfiniBand).
  • Security and compliance expertise including zero-trust architectures, secrets management, and auditing frameworks.
  • Cost management and optimization experience for large-scale compute infrastructure.
  • Build system fluency and comfort with modern build tools (CMake, Bazel, Meson, Buck, Ninja).
  • Experience supporting AI / ML research environments and training pipeline infrastructure.
  • "Automation first" mindset - you reduce toil by codifying repeatable operations.
  • Deep understanding of DevOps philosophy, not just the tools - you live and breathe the culture.
  • HPC comfort - you can debug Slurm jobs, GPU driver issues, or InfiniBand problems without hesitation.
  • Cloud + HPC pragmatism - you know when to leverage AWS primitives versus optimizing HPC schedulers.
  • Track record of mentoring and elevating teams, building collaboratively rather than in isolation.
  • Passion for building state-of-the-art platforms with reproducibility and robust CI / CD at their core.
  • Interested candidates are invited to submit their resume, a cover letter detailing their qualifications and vision for the role, and references. Please include "Member of Technical Staff, DevOps / Infrastructure Engineering" in the cover letter.

    Join us at FirstPrinciples and be a part of a transformative journey where science drives progress and unlocks the potential of humanity.

    Create a job alert for this search

    Member of Technical Staff DevOps Infrastructure Engineering • Oakland, CA, United States

    Similar jobs
    Member of Technical Staff, DevSecOps / Infrastructure

    Member of Technical Staff, DevSecOps / Infrastructure

    Mandolin • San Francisco, CA, United States
    Full-time
    Nearly every disease will become treatable in our lifetimes.Mandolin is laying the clinical and financial infrastructure to get groundbreaking treatments to patients faster, powered by AI agents.Ma...Show more
    Last updated: 12 days ago • Promoted
    Staff Infrastructure Engineer

    Staff Infrastructure Engineer

    Replit • San Mateo, CA, United States
    Full-time
    Join our Infrastructure Engineering team and help ensure the reliability, scalability, and performance of Replit's infrastructure that serves millions of developers worldwide.As a Staff Infrastruct...Show more
    Last updated: 3 days ago • Promoted
    Member of Technical Staff (Forward Deployed)

    Member of Technical Staff (Forward Deployed)

    Krew Research • San Francisco, CA, United States
    Full-time
    This role is not eligible for visa sponsorship.Krew is on a mission to transform credit-servicing with the industry's most advanced AI credit-servicing agents. We are backed by Long Journey Ventures...Show more
    Last updated: 12 days ago • Promoted
    Staff Infrastructure Engineer

    Staff Infrastructure Engineer

    Zoox • Foster City, CA, US
    Full-time
    Zoox is seeking a talented Staff Infrastructure Engineer to lead the development of test infrastructure that supports manufacturing tests for our autonomous vehicles. In this role, you will drive th...Show more
    Last updated: 30+ days ago • Promoted
    Member of Technical Staff, Training Infra Engineer

    Member of Technical Staff, Training Infra Engineer

    Cohere • San Francisco, CA, United States
    Full-time
    Member of Technical Staff, Training Infra Engineer role at Cohere.Why this role? Contribute in and provide strong support for model training pipelines, ship state of the art models to production, a...Show more
    Last updated: 1 day ago • Promoted
    Staff Infrastructure Engineer

    Staff Infrastructure Engineer

    TwelveLabs • San Francisco, CA, United States
    Full-time
    At TwelveLabs, we are pioneering the development of cutting‑edge multimodal foundation models that can comprehend videos just like humans do. Our models have redefined the standards in video‑languag...Show more
    Last updated: 30+ days ago • Promoted
    Staff Engineer - DevOps

    Staff Engineer - DevOps

    Exelixis • Alameda, CA, United States
    Full-time
    Our IT Infrastructure needs to grow and evolve to continue to enable the business.We are looking for a Staff Engineer - DevOps to support the design and creation of cloud-based products and service...Show more
    Last updated: 30+ days ago • Promoted
    Member of Technical Staff, DevOps / Infrastructure Engineering

    Member of Technical Staff, DevOps / Infrastructure Engineering

    FirstPrinciples • Oakland, CA, United States
    Full-time
    Member of Technical Staff, DevOps / Infrastructure Engineering.FirstPrinciples is a non-profit organization building an autonomous AI Physicist designed to advance humanity's understanding of the f...Show more
    Last updated: 12 days ago • Promoted
    Member of Technical Staff, Infrastructure

    Member of Technical Staff, Infrastructure

    Envoy • San Francisco, CA, United States
    Full-time
    Envoy builds workspace management technology that makes it simple to run secure, compliant, and connected workplaces across every location. Over 16,000 workplaces and properties around the world rel...Show more
    Last updated: 12 days ago • Promoted
    Member of Technical Staff, Infrastructure (Security)

    Member of Technical Staff, Infrastructure (Security)

    Superpowered Inc • San Francisco, CA, United States
    Full-time
    Base pay range : $150,000 $300,000 per year (actual pay may vary based on skills and experience).We serve large enterprises with serious data security concerns. We run a large multi?tenant platform a...Show more
    Last updated: 3 days ago • Promoted
    Member of Technical Staff - Full Stack

    Member of Technical Staff - Full Stack

    Hyperbolic Labs • San Francisco, CA, United States
    Full-time
    Full Stack Engineer At Hyperbolic Labs.As a Full Stack Engineer at Hyperbolic Labs, you'll work closely with our Product and Engineering teams to design, build, and scale end-to-end applications th...Show more
    Last updated: 5 hours ago • Promoted • New!
    Member of Technical Staff, Infrastructure

    Member of Technical Staff, Infrastructure

    VAPI • San Francisco, CA, United States
    Full-time
    We're creating the shift to voice as humanity's default interface.We're the most configurable platform for deploying voice agents. We're grown to 400,000 developers in 20 months, adding 2,000+ every...Show more
    Last updated: 12 days ago • Promoted
    Senior / Staff Infrastructure Engineer

    Senior / Staff Infrastructure Engineer

    APIphany • San Francisco, CA, United States
    Full-time
    Apiphany is a pioneering foundational AI company for physical product development.We empower global innovators in automotive, aerospace, medtech, and energy to transform mountains of unstructured t...Show more
    Last updated: 9 days ago • Promoted
    Member of Technical Staff - Full-Stack Engineer

    Member of Technical Staff - Full-Stack Engineer

    Liquid AI • San Francisco, CA, United States
    Full-time
    Spun out of MIT CSAIL, we build general?purpose AI systems that run efficiently across deployment targets, from data center accelerators to on?device hardware, ensuring low latency, minimal memory ...Show more
    Last updated: 5 hours ago • Promoted • New!
    Staff Infrastructure Engineer

    Staff Infrastructure Engineer

    Replit, Inc. • San Mateo, CA, United States
    Full-time
    Replit is the agentic software creation platform that enables anyone to build applications using natural language.With millions of users worldwide and over 500,000 business users, Replit is democra...Show more
    Last updated: 8 days ago • Promoted
    Member of Technical Staff - Forward Deployed Software Engineer

    Member of Technical Staff - Forward Deployed Software Engineer

    Patronus AI • San Francisco, CA, United States
    Full-time
    Patronus AI is a frontier lab developing simulation research and infrastructure to accelerate progress toward human-aligned AGI. We are on a mission to simulate all of the world's intelligence.We ar...Show more
    Last updated: 12 days ago • Promoted
    Staff Infrastructure Engineer

    Staff Infrastructure Engineer

    Kubelt • San Francisco, CA, United States
    Full-time
    World is a network of real humans, built on privacy‑preserving proof‑of‑human technology, and powered by a globally inclusive financial network that enables the free flow of digital assets for all....Show more
    Last updated: 30+ days ago • Promoted
    Member of Technical Staff - GPU Infrastructure

    Member of Technical Staff - GPU Infrastructure

    Reflection AI • San Francisco, CA, United States
    Full-time
    Design, build, and operate Reflection's large-scale GPU infrastructure powering pre-training, post-training, and inference. Develop reliable, high-performance systems for scheduling, orchestration, ...Show more
    Last updated: 7 days ago • Promoted