Talent.com
Infrastructure Engineer (Hybrid Cloud & Platform)
Infrastructure Engineer (Hybrid Cloud & Platform)Aldea Inc • San Francisco, California, United States, 94102
No longer accepting applications
Infrastructure Engineer (Hybrid Cloud & Platform)

Infrastructure Engineer (Hybrid Cloud & Platform)

Aldea Inc • San Francisco, California, United States, 94102
1 day ago
Job type
  • Full-time
Job description

Location : US Remote / Bay Area

Job Type : Full-time

Level : Mid-Level / Senior

About Aldea

Aldea is a multi-modal foundational AI company reimagining the scaling laws of intelligence. We believe today's architectures create unnecessary bottlenecks for the evolution of software. Our mission is to build the next generation of foundational models that power a more expressive, contextual, and intelligent human–machine interface.

The Mission

We are seeking an Infrastructure Engineer to bridge the gap between complex hybrid infrastructure and developer velocity. You will architect a unified platform spanning AWS and Bare Metal Kubernetes .

At this level, you bring technical direction and expertise to the table. You will participate in planning and discussion for architecting resilient infrastructure, drive cross-team initiatives, and mentor other engineers while remaining deeply hands-on. Your ultimate goal is to build a "Golden Path" for engineering : automated releases, deep observability, and a platform experience that feels invisible to the end user.

Key Responsibilities

1. Hybrid Infrastructure & Bare Metal (AWS + K8s)

  • Unified IaC Strategy : Architect and maintain the Terraform codebase for both AWS services (EKS, RDS, VPC) and Bare Metal clusters. You will treat physical infrastructure as mutable software, using tools like Cluster API , Metal3 , or Tinkerbell to manage hardware lifecycles.
  • Bare Metal Mastery : Manage multiple production clusters on bare metal with clear separation of environments. You will solve complex challenges including networking (BGP, ECMP), load balancing (MetalLB / Kube-VIP), and storage orchestration (CSI / Rook-Ceph) for stateful workloads.

2. Observability & AI Monitoring

  • Full-Stack Visibility : Contribute to building our stack ( Prometheus, Grafana, ELK / Loki ) to monitor both EKS and bare metal.
  • AI / GPU Telemetry : Build specialized dashboards for AI workloads. You will track GPU metrics , CPU saturation, and memory pressure to ensure efficient resource utilization.
  • 4. CI / CD & Release Architecture

  • CI / CD at Scale : Architect resilient, multi-region pipelines using GitHub Actions . Automated CI / CD for apps using ArgoCD . You will build and manage a fleet of self-hosted runners to control costs and accelerate feedback loops.
  • Secure Release Engineering : Implement end-to-end workflows : Docker image build Helm chart release deployment (GH Actions + ArgoCD). Semantic versioning, manage artifacts in centralized registries, and integrate vulnerability scanning .
  • 5. Leadership & Collaboration

  • Technical Direction : Lead design reviews and drive platform roadmaps that balance reliability, cost, and developer productivity.
  • Cross-Functional Partnership : Partner with product, security, and application teams to translate business needs into robust platform capabilities.
  • Requirements

  • Experience : Infrastructure, DevOps, or SRE roles, with primary ownership of production systems in AWS and Bare Metal Kubernetes .
  • Technical Arsenal : Expert fluency in Terraform , Linux / Bash or Python scripting, and GitHub Actions , and ArgoCD
  • Bare Metal & K8s : Proven experience operating Kubernetes in production, including hybrid setups (EKS + On-Prem). You understand networking (CNI, BGP), storage (CSI), and cluster lifecycle management.
  • Observability Depth : You have moved beyond "out-of-the-box" dashboards. You understand high-cardinality metrics, log retention strategies, and how to debug distributed systems.
  • Platform Mindset : You don't just build servers; you build products for developers.
  • Bonus

  • Experience with OpenTelemetry (OTEL) for unified tracing.
  • Understanding of eBPF
  • Experience configuring NVIDIA DCGM for GPU monitoring and handling AI training / inference workloads.
  • Aldea is proud to be an equal-opportunity employer. We are committed to building a diverse and inclusive culture that celebrates authenticity to win as one. We do not discriminate on the basis of race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, disability, protected veteran status, citizenship or immigration status, or any other legally protected characteristics.

    Aldea uses E-Verify to confirm employment eligibility in compliance with federal law. For more information please visit : https : / / www.e-verify.gov .

    Please note : We do not accept unsolicited resumes from recruiters or employment agencies and will not be responsible for any fees related to unsolicited resumes.

    PI3d93cf01e1bb-30511-39154745

    Create a job alert for this search

    Cloud Infrastructure Engineer • San Francisco, California, United States, 94102

    Related jobs
    Infrastructure Engineer

    Infrastructure Engineer

    FAR.AI • Berkeley, California, United States
    Full-time
    AI is a non-profit AI research institute dedicated to ensuring advanced AI is safe and beneficial for everyone.Our mission is to facilitate breakthrough AI safety research, advance global understan...Show more
    Last updated: 30+ days ago • Promoted
    Cloud Infrastructure Engineer

    Cloud Infrastructure Engineer

    Braintrust • San Francisco, CA, United States
    Full-time
    Braintrust is building the modern platform for evaluating and deploying AI systems.Our mission is to help enterprises build trust in their AI by making it easy to test, monitor, and improve models ...Show more
    Last updated: 30+ days ago • Promoted
    Founding Cloud Infrastructure Engineer

    Founding Cloud Infrastructure Engineer

    zaimler • San Mateo, CA, United States
    Full-time
    We're creating the foundation for AI systems that don't just generate, but retrieve, link, and reason over enterprise knowledge. In just over a year, we've begun partnering with Fortune 500 design p...Show more
    Last updated: 30+ days ago • Promoted
    Platform Engineer — Remote Infra, Terraform & Cloud

    Platform Engineer — Remote Infra, Terraform & Cloud

    Clipboard • San Francisco, CA, United States
    Remote
    Full-time
    A leading technology company in San Francisco is seeking a Platform Engineer to manage and develop scalable infrastructure using modern tools like Terraform and AWS. The position entails working in ...Show more
    Last updated: 4 days ago • Promoted
    Founding Cloud Infrastructure Engineer

    Founding Cloud Infrastructure Engineer

    Thunder Compute • San Francisco, CA, United States
    Full-time
    Founding Cloud Infrastructure Engineer.Build our cloud infrastructure.You will work on a high-stakes, production system, where stability and maintainability are key. Directly interacting with custom...Show more
    Last updated: 30+ days ago • Promoted
    Platform & Infrastructure Engineer

    Platform & Infrastructure Engineer

    MindsDB • San Francisco, CA, United States
    Full-time
    Retrieved from the description.MindsDB is a fast-growing AI startup headquartered in San Francisco, California.MindsDB is an AI Analytics solution that connects to diverse data sources and applicat...Show more
    Last updated: 11 days ago • Promoted
    Infrastructure Engineer

    Infrastructure Engineer

    LangChain • San Francisco, CA, United States
    Full-time
    At LangChain, our mission is to make intelligent agents ubiquitous.We provide the agent engineering platform and open source frameworks developers need to ship reliable agents fast.Our open source ...Show more
    Last updated: 30+ days ago • Promoted
    Platform & Infrastructure Engineer

    Platform & Infrastructure Engineer

    Mindsdb • San Francisco, California, United States
    Full-time
    MindsDB is a fast-growing AI startup headquartered in San Francisco, California.MindsDB is an AI Analytics solution that connects to diverse data sources and applications then unifies structured an...Show more
    Last updated: 30+ days ago • Promoted
    Founding Cloud Infrastructure Engineer

    Founding Cloud Infrastructure Engineer

    Rethink recruit • San Mateo, CA, United States
    Full-time
    We're creating the foundation for.AI systems that don't just generate-but retrieve, link, and reason.In just over a year, we've begun partnering with. AI infrastructure into some of the world's most...Show more
    Last updated: 17 days ago • Promoted
    Lead Platform Engineer (Network Infrastructure)

    Lead Platform Engineer (Network Infrastructure)

    Capital One • San Francisco, CA, United States
    Full-time +1
    Lead Platform Engineer (Network Infrastructure).Do you love building and pioneering in the technology space? Do you enjoy solving complex technical problems in a fast-paced, collaborative, inclusiv...Show more
    Last updated: 11 days ago • Promoted
    Infrastructure Engineer

    Infrastructure Engineer

    Factory • San Francisco, CA, United States
    Full-time
    Factory is seeking seasoned Infrastructure Engineers to architect, build, and maintain our cloud infrastructure.Lead the design and implementation of robust, secure, and highly scalable cloud infra...Show more
    Last updated: 30+ days ago • Promoted
    Cloud Infrastructure Engineer

    Cloud Infrastructure Engineer

    Florvets Structures • San Francisco, California, United States
    Remote
    Full-time +1
    Position : Cloud Infrastructure Engineer.Florvets Structures is a leading construction and engineering company based in San Francisco, California. We specialize in building innovative and sustainable...Show more
    Last updated: 30+ days ago • Promoted
    Cloud Infrastructure Engineer

    Cloud Infrastructure Engineer

    Brain Trust Inc • San Francisco, CA, United States
    Full-time
    Braintrust is the AI observability platform.By connecting evals and observability in one workflow, Braintrust gives builders the visibility to understand how AI behaves in production and the tools ...Show more
    Last updated: 13 days ago • Promoted
    Senior Cloud Infrastructure Engineer

    Senior Cloud Infrastructure Engineer

    The Recruiting Guy • San Francisco, CA, United States
    Full-time
    If this role is still posted then we are still recruiting and needing applications.Senior Cloud Infrastructure Engineer.Must live within commuting distance of San Francisco or be willing to relocat...Show more
    Last updated: 11 days ago • Promoted
    Senior Cloud Infrastructure Engineer

    Senior Cloud Infrastructure Engineer

    Omni Analytics, Inc. • San Francisco, CA, United States
    Full-time
    Omni gives businesses one place to easily analyze all their data.Built by the teams behind Looker and Stitch, Omni combines data models, a point-and-click UI, spreadsheet formulas, and powerful vis...Show more
    Last updated: 30+ days ago • Promoted
    Senior Cloud Infrastructure Engineer

    Senior Cloud Infrastructure Engineer

    TaskRabbit • San Francisco, CA, United States
    Full-time
    Taskrabbit is a marketplace platform that conveniently connects people with Taskers to handle everyday home to-do’s, such as furniture assembly, handyman work, moving help, and much more.At Taskrab...Show more
    Last updated: 20 hours ago • Promoted • New!
    AI Platform Engineer, Infrastructure

    AI Platform Engineer, Infrastructure

    Brainco • San Francisco, CA, United States
    Full-time
    Applied AI startup founded by Elad Gil and Jared Kushner, and backed by many of Silicon Valley's leading builders - including Patrick Collison (CEO of Stripe), Andrej Karpathy (Cofounder of OpenAI)...Show more
    Last updated: 11 days ago • Promoted
    Founding CAE Cloud Infrastructure Engineer

    Founding CAE Cloud Infrastructure Engineer

    UniversalAGI • San Francisco, CA, United States
    Full-time
    San Francisco | Work Directly with CEO & founding team | Report to CEO | OpenAI for Physics | 5 Days Onsite.Founding CAE Cloud Infrastructure Engineer. Location : Onsite in San Francisco.Compensation...Show more
    Last updated: 14 days ago • Promoted