Talent.com
Distributed Systems Engineer
Distributed Systems Engineerkrea.ai • San Francisco, CA, United States
No longer accepting applications
Distributed Systems Engineer

Distributed Systems Engineer

krea.ai • San Francisco, CA, United States
8 hours ago
Job type
  • Full-time
Job description

About Krea

At Krea, we are building next‑generation AI creative tools. We are dedicated to making AI intuitive and controllable for creatives. Our mission is to build tools that empower human creativity, not replace it. We believe AI is a new medium that allows us to express ourselves through various formats—text, images, video, sound, and even 3D. We’re building better, smarter, and more controllable tools to harness this medium.

This job

Robust, reliable, and scalable distributed systems form the backbone of Krea. These systems support the infrastructure that powers our AI research, real‑time user experiences, and large‑scale model deployments. As a Distributed Systems Engineer, you will design, build, and maintain large‑scale distributed infrastructure to reliably support AI research and real‑time model serving. You will own and scale our multi‑thousand‑node Kubernetes GPU clusters, ensuring efficient and fault‑tolerant operations. You will collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment. You will improve network architecture, optimize load balancing, and streamline operational practices across multi‑zone cloud deployments.

Responsibilities

  • Design, build, and maintain large‑scale distributed infrastructure to reliably support AI research and real‑time model serving.
  • Own and scale our multi‑thousand‑node Kubernetes GPU clusters, ensuring efficient and fault‑tolerant operations.
  • Collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment.
  • Improve network architecture, optimize load balancing, and streamline operational practices across multi‑zone cloud deployments.

Example Projects

  • Own and manage a large‑scale Kubernetes cluster designed to run extensive ML training and inference workloads.
  • Architect fault‑tolerant systems ensuring uninterrupted model training and real‑time inference despite individual node failures.
  • Develop and implement optimized load‑balancing strategies to efficiently distribute workloads across zones.
  • Create comprehensive monitoring, alerting systems, and operational playbooks for high‑availability clusters.
  • Migrate existing deployments to Infrastructure as Code (Terraform) for reproducibility and scalability.
  • Setting up IP‑based rate‑limiting to prevent GPU abuse.
  • Strong Candidates May Have Experience With

  • Kubernetes at scale (thousands of nodes)
  • Cloud infrastructure management (AWS / GCP / Azure)
  • High‑performance and fault‑tolerant networking
  • Low‑level Linux interfaces and administration
  • Debugging complex distributed systems in production
  • Python, Golang, Ruby, Rust, and similar systems languages
  • Bonus : Infrastructure as Code (e.g. Terraform)
  • About Us

  • We’re building AI creative tooling.
  • We’ve raised over $83M from the best investors in Silicon Valley.
  • We’re a team of 12 with millions of active users scaling aggressively.
  • #J-18808-Ljbffr

    Create a job alert for this search

    System Engineer • San Francisco, CA, United States

    Related jobs
    IT Systems Engineer - East

    IT Systems Engineer - East

    Omada Health • South San Francisco, CA, United States
    Full-time
    Candidates must reside on the East Coast in the U.Omada Health is on a mission to inspire and engage people in lifelong health, one step at a time. As an IT Systems Engineer, you will play a critica...Show more
    Last updated: 9 days ago • Promoted
    Systems Engineer, Infrastructure

    Systems Engineer, Infrastructure

    hud (YC W25) • San Francisco, CA, United States
    Full-time +1
    HUD (YC W25) is developing agentic evals for Computer Use Agents (CUAs) that browse the web.Our CUA Evals framework is the first comprehensive evaluation tool for CUAs. HUD (YC W25) is backed by Y C...Show more
    Last updated: 30+ days ago • Promoted
    Principal DevOps Engineer

    Principal DevOps Engineer

    Informatica LLC • Redwood City, CA, United States
    Full-time
    Build Your Career at Informatica.We seek innovative thinkers who believe in the power of data to drive meaningful change. At Informatica, we welcome adventurous minds eager to solve the world's most...Show more
    Last updated: 30+ days ago • Promoted
    Infrastructure Engineer

    Infrastructure Engineer

    FAR.AI • Berkeley, California, United States
    Full-time
    AI is a non-profit AI research institute dedicated to ensuring advanced AI is safe and beneficial for everyone.Our mission is to facilitate breakthrough AI safety research, advance global understan...Show more
    Last updated: 30+ days ago • Promoted
    Software Engineer, Distributed Systems

    Software Engineer, Distributed Systems

    Replit • Foster City, California, United States
    Full-time
    Replit is the fastest way to turn ideas into software.With our powerful AI-powered Agent and Assistant, anyone can create and launch apps from natural language in just one click.Build and deploy fu...Show more
    Last updated: 30+ days ago • Promoted
    Senior Infra Engineer - Distributed Systems & Cloud (Remote)

    Senior Infra Engineer - Distributed Systems & Cloud (Remote)

    Sift Stack, Inc. • San Francisco, CA, United States
    Remote
    Full-time
    A tech startup focusing on machine observability is looking for an experienced engineer to design and maintain scalable infrastructure solutions. Responsibilities include optimizing application perf...Show more
    Last updated: 5 days ago • Promoted
    System Integration Engineer

    System Integration Engineer

    Civ Robotics • San Francisco, CA, United States
    Full-time
    Civ Robotics is on a mission to automate repetitive tasks within the $3 trillion infrastructure construction industry.We’re dedicated to bridging the workforce gap and accelerating the development ...Show more
    Last updated: 14 days ago • Promoted
    Linux System / Platform Engineer

    Linux System / Platform Engineer

    Lawrence Berkeley National Laboratory • Berkeley, CA, United States
    Full-time
    The National Energy Research Scientific Computing Center (NERSC) is seeking a versatile Linux System / Platform Engineer to join our team building and managing Linux-based infrastructure.More than ...Show more
    Last updated: 30+ days ago • Promoted
    Distributed Systems Engineer — Consensus & Locking

    Distributed Systems Engineer — Consensus & Locking

    Amazon • San Francisco, CA, United States
    Full-time
    A leading technology company is seeking a Software Development Engineer to build systems that power distributed consensus across AWS. This role requires expertise in software development, system arc...Show more
    Last updated: 2 days ago • Promoted
    Senior Software Engineer, Distributed Systems

    Senior Software Engineer, Distributed Systems

    Mixpanel • San Francisco, CA, United States
    Full-time
    Mixpanel is an event analytics platform for builders who need answers from their data at their fingertips-no SQL required. When everyone in the organization can see and learn from the impact of thei...Show more
    Last updated: 30+ days ago • Promoted
    System Engineer

    System Engineer

    Acceler8 Talent • San Francisco, CA, United States
    Full-time
    This range is provided by Acceler8 Talent.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Stop Losing Sleep Over ML / SW Hiring | Acceler8 Talent / ...Show more
    Last updated: 7 days ago • Promoted
    Software Engineer (Distributed Systems)

    Software Engineer (Distributed Systems)

    Browserbase, Inc. • San Francisco, CA, United States
    Full-time
    As a Software Engineer (Distributed Systems) at.You’ll ensure it is high performance, scalable, constantly evolving and growing, and that our customers. As a Distributed Systems Engineer at Browserb...Show more
    Last updated: 30+ days ago • Promoted
    Senior Backend Engineer - Distributed Systems

    Senior Backend Engineer - Distributed Systems

    Verkada • San Mateo, California, United States
    Full-time
    Designed with simplicity in mind, Verkada's six product lines — video security cameras, access control, environmental sensors, alarms, workplace, and intercoms — provide unparalleled building secur...Show more
    Last updated: 30+ days ago • Promoted
    Software Engineer, Distributed Systems

    Software Engineer, Distributed Systems

    Figma • San Francisco, CA, United States
    Full-time
    Figma is growing our team of passionate creatives and builders on a mission to make design accessible to all.Figma's platform helps teams bring ideas to life-whether you're brainstorming, creating ...Show more
    Last updated: 8 days ago • Promoted
    Distributed Systems Engineer - Data Platform - Analytics and Alerts

    Distributed Systems Engineer - Data Platform - Analytics and Alerts

    Cloudflare Inc • San Francisco, CA, United States
    Full-time
    At Cloudflare, we are on a mission to help build a better Internet.Today the company runs one of the world's largest networks that powers millions of websites and other Internet properties for cust...Show more
    Last updated: 20 days ago • Promoted
    Wireless Systems Engineer, Ranging and Sensing

    Wireless Systems Engineer, Ranging and Sensing

    Apple Inc. • San Francisco, CA, United States
    Full-time
    Wireless Systems Engineer, Ranging and Sensing.San Francisco Bay Area, California, United States Hardware.At Apple, we work every single day to craft products that enrich people’s lives.Do you love...Show more
    Last updated: 30+ days ago • Promoted
    Software Engineer, Distributed Systems

    Software Engineer, Distributed Systems

    OpenAI • San Francisco, CA, United States
    Full-time
    The Compute Runtime team builds the low level framework components to power our ML training systems.We work on building robust, scalable, high performance components to support our distributed trai...Show more
    Last updated: 30+ days ago • Promoted
    Distributed Systems Engineer - Data Platform - Logs and Audit Logs

    Distributed Systems Engineer - Data Platform - Logs and Audit Logs

    Cloudflare, Inc. • San Francisco, CA, United States
    Full-time
    At Cloudflare, we are on a mission to help build a better Internet.Cloudflare protects and accelerates any Internet application online without adding hardware, installing software, or changing a li...Show more
    Last updated: 30+ days ago • Promoted