Talent.com
Senior Software Engineer - Together Cloud Infrastructure

Senior Software Engineer - Together Cloud Infrastructure

Together AISan Francisco, CA, United States
30+ days ago
Job type
  • Full-time
Job description

Senior Software Engineer - Together Cloud Infrastructure

Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine with state-of-the-art AI cloud infrastructure.

As a Senior AI Infrastructure Engineer, you will play a key role in building the next generation AI cloud platform a highly available, global, blazing-fast cloud infrastructure that virtualizes cutting-edge ML hardware (GB200s / GB300s, BlueField DPUs) and enables state-of-the-art ML practitioners with self-serve AI cloud services, such as on-demand + managed Kubernetes and Slurm clusters. This platform serves both our internal SaaS products (inference, fine-tuning) and our external cloud customers, spanning dozens of data centers across the world.

Some of what youll work on :

  • Design, build, and maintain performant, secure, and highly-available backend services / operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning.
  • Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs.
  • Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining.
  • Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining.

To be successful, youll need to be deeply technical and possess excellent communication, collaboration, and diplomacy skills. You have strong fundamental software development skills. In addition, you have strong systems knowledge and troubleshooting abilities.

Requirements

  • 5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired)
  • 5+ years experience writing high-performance, well-tested, production quality code
  • Demonstrated experience with building and operating high-performance and / or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP)
  • Excellent communication skills able to write clear design docs and work effectively with both technical and non-technical team members
  • Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device / storage / network plugins, custom schedulers, or patches thereon or Kubernetes itself
  • Deep experience with VMs / hypervisors a big plus, such as QEMU / KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV
  • Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS / OVN
  • Experience with Cluster API or similar a big plus
  • Experience working on high-performance compute, networking, and / or storage a big plus
  • Experience virtualizing GPUs and / or Infiniband a big plus
  • Strong systems knowledge across compute, networking, and storage, including concurrency, memory management, performant I / O, and scale
  • Experience with infrastructure automation tools (Terraform, Ansible), monitoring / observability stacks (Prometheus, Grafana), and CI / CD pipelines (GitHub Actions, ArgoCD)
  • Experience building IaaS or PaaS systems at scale a plus
  • Experience with DPUs / SmartNICs a plus
  • GPU programming, NCCL, CUDA knowledge a plus
  • Responsibilities

  • Perform architecture and research work for decentralized AI workloads
  • Work on the core, open-source Together AI platform
  • Create services, tools, and developer documentation
  • Create testing frameworks for robustness and fault-tolerance
  • About Together AI

    Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

    Compensation

    We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

    Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

    Accepted file types : pdf, doc, docx, txt, rtf

    Enter manually

    Accepted file types : pdf, doc, docx, txt, rtf

    LinkedIn Profile

    Are you able to work 4 days per week in our SF office?

    #J-18808-Ljbffr

    Create a job alert for this search

    Cloud Infrastructure Engineer • San Francisco, CA, United States

    Related jobs
    • Promoted
    Principal DevOps Engineer

    Principal DevOps Engineer

    Informatica LLCRedwood City, CA, United States
    Full-time
    Build Your Career at Informatica.We seek innovative thinkers who believe in the power of data to drive meaningful change. At Informatica, we welcome adventurous minds eager to solve the world's most...Show moreLast updated: 13 days ago
    • Promoted
    Senior Cloud Engineer

    Senior Cloud Engineer

    Verily Life SciencesMountain View, CA, United States
    Full-time
    Verily is a subsidiary of Alphabet that is using a data-driven approach to change the way people manage their health and the way healthcare is delivered. Launched from Google X in 2015, our purpose ...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Senior Cloud Infrastructure Engineer

    Senior Cloud Infrastructure Engineer

    Harrison ClarkeSan Francisco, CA, United States
    Full-time
    Annual Bonus, Sign-on bonus, RSUs, and Stock options.Join a dynamic startup seeking an infrastructure specialist to design, scale, and maintain cutting-edge infrastructure that powers innovative di...Show moreLast updated: 22 hours ago
    • Promoted
    • New!
    Senior Software Engineer - Cloud

    Senior Software Engineer - Cloud

    iSono HealthSouth San Francisco, CA, United States
    Full-time
    Sono Health is a dynamic and rapidly growing early clinical-commercial stage medical device company dedicated to saving lives and transforming care through innovative robotics, AI, and 3D ultrasoun...Show moreLast updated: 22 hours ago
    • Promoted
    Senior Cloud Engineer

    Senior Cloud Engineer

    University of CaliforniaSan Francisco, CA, United States
    Full-time
    The Senior Cloud Engineer will be accountable for driving the configuration and operation of University of California, San Francisco (UCSF) cloud infrastructure services. The Senior Cloud Engineer w...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Infrastructure Engineer

    Senior Infrastructure Engineer

    PumpSan Francisco, CA, United States
    Full-time
    Cloud spend is a whopping $500 billion / yr, the biggest growing expense category for any tech company - tackling these costs requires continuous effort and time from DevOps teams.Pump is a building ...Show moreLast updated: 30+ days ago
    • Promoted
    • New!
    Senior Software Engineer - Together Cloud InfrastructureSan Francisco

    Senior Software Engineer - Together Cloud InfrastructureSan Francisco

    Together AISan Francisco, CA, United States
    Full-time
    Senior Software Engineer - Together Cloud.Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine ...Show moreLast updated: 22 hours ago
    • Promoted
    Senior Cloud Engineer

    Senior Cloud Engineer

    University of California, San FranciscoSan Francisco, CA, United States
    Full-time
    The Senior Cloud Engineer will be accountable for driving the configuration and operation of University of California, San Francisco (UCSF) cloud infrastructure services. The Senior Cloud Engineer w...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Software Engineer, Cloud Platform

    Senior Software Engineer, Cloud Platform

    Chef Robotics, Inc.San Francisco, CA, United States
    Full-time
    Chef Robotics is on a mission to accelerate the advent of intelligent machines in the physical world.As the rise of LLMs like ChatGPT has shown, AI has the potential to drive immense change.However...Show moreLast updated: 25 days ago
    • Promoted
    Senior Cloud Infrastructure Engineer

    Senior Cloud Infrastructure Engineer

    Omni Analytics, Inc.San Francisco, CA, United States
    Full-time
    Omni gives businesses one place to easily analyze all their data.Built by the teams behind Looker and Stitch, Omni combines data models, a point-and-click UI, spreadsheet formulas, and powerful vis...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Cloud Infrastructure Engineer

    Senior Cloud Infrastructure Engineer

    ExploreomniSan Francisco, CA, United States
    Full-time
    Omni gives businesses one place to easily analyze all their data.Built by the teams behind Looker and Stitch, Omni combines data models, a point-and-click UI, spreadsheet formulas, and powerful vis...Show moreLast updated: 13 days ago
    • Promoted
    Senior Software Engineer, Cloud Platform

    Senior Software Engineer, Cloud Platform

    Chef RoboticsSan Francisco, CA, United States
    Full-time
    Chef Robotics is on a mission to accelerate the advent of intelligent machines in the physical world.As the rise of LLMs like ChatGPT has shown, AI has the potential to drive immense change.However...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Infrastructure Software Engineer, Enterprise AI

    Senior Infrastructure Software Engineer, Enterprise AI

    Scale AI, Inc.San Francisco, CA, United States
    Full-time
    Scale GP is building the next generation of enterprise-grade Generative AI products.Our platform provides APIs for knowledge retrieval, inference, and evaluation, enabling customers to build and de...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Software Engineer, Cloud

    Senior Software Engineer, Cloud

    NuonSan Francisco, CA, United States
    Full-time
    As a Senior Software Engineer, Cloud at Nuon, you will be responsible for building and maintaining features to manage cloud infrastructure across multiple platforms. You should have extensive backen...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Engineer, Backend Platform

    Senior Engineer, Backend Platform

    ZūmRedwood City, CA, US
    Full-time
    Zum is a rapidly expanding Series E startup backed by industry leaders Sequoia Capital, SoftBank, Spark Capital, and GIC, with a bold mission to transform the stagnant school transportation industr...Show moreLast updated: 5 days ago
    • Promoted
    Senior Cloud Infrastructure Engineer

    Senior Cloud Infrastructure Engineer

    LanceDBSan Francisco, CA, United States
    Full-time
    From hyper-scalable vector search to advanced retrieval for RAG, from streaming training data to interactive exploration of large-scale AI datasets, LanceDB is the best foundation for your AI appli...Show moreLast updated: 9 days ago
    • Promoted
    • New!
    Senior Cloud Engineer

    Senior Cloud Engineer

    NuonSan Francisco, CA, United States
    Full-time
    As a Senior Software Engineer, Cloud at Nuon, you will be responsible for building and maintaining features to manage cloud infrastructure across multiple platforms. You should have extensive backen...Show moreLast updated: 22 hours ago
    • Promoted
    • New!
    Senior Software Engineer, Cloud Operations

    Senior Software Engineer, Cloud Operations

    BoxRedwood City, CA, United States
    Full-time
    Box (NYSE : BOX) is the leader in Intelligent Content Management.Our platform enables organizations to fuel collaboration, manage the entire content lifecycle, secure critical content, and transform ...Show moreLast updated: 22 hours ago