Talent.com
Pod Networking Software Engineer
Pod Networking Software EngineerEtched.ai, Inc. • San Jose, CA, United States
Pod Networking Software Engineer

Pod Networking Software Engineer

Etched.ai, Inc. • San Jose, CA, United States
1 day ago
Job type
  • Full-time
Job description

Overview

We are seeking highly motivated and skilled Pod Networking Software Engineers to join our System Software team. This team plays a critical role in developing, qualifying, and optimizing high-performance networking solutions for large-scale inference workloads. As a Pod Software Engineer, you will focus on developing and qualifying software that drives communication amongst Sohu inference nodes in multi-rack inference clusters. You will collaborate closely with kernel, platform, and telemetry teams to push the boundaries of peer-to-peer RDMA efficiency.

Key Responsibilities

High Performance Peer to Peer Networking : Design, develop, and implement RDMA based networking peering, supporting high bandwidth, low latency communication across PCIe nodes within and across racks. Includes work across Operating System, kernel drivers, embedded software and system software.

Test Development : Develop tests that qualify host processors (x86), NICs, TORs and device network interfaces for high performance.

Burn-in integration : Furnish burn-in teams with tests that represent both real-world use cases and workloads for device to device networking, and extreme-load stress testing.

Performance / Health Telemetry Design : Define the key metrics that system software must collect to maintain high availability and performance under extreme communications workloads.

Representative Projects

Analyze performance deviations, optimize network stack configurations, and propose kernel tuning parameters for low-latency, high-bandwidth inference workloads.

Design and execute automated qualification tests for RDMA NICs and interconnects across various server configurations.

Identify and root-cause firmware, driver, and hardware issues that impact RDMA performance and reliability.

Collaborate with ODMs and silicon vendors to validate new RDMA features and enhancements.

Implement and validate peer RDMA support for GPU-to-GPU and accelerator-to-accelerator communication.

Modify kernel drivers and user-space libraries to optimize direct memory access between inference pods.

Profile and benchmark inter-node RDMA latency and bandwidth to improve inference job scaling.

Optimize NIC and switch configurations to balance throughput, congestion control, and reliability.

Must-Have Skills and Experience

Proficiency in C / C++

Proficiency in at least one scripting language (e.g., Python, Bash, Go).

Strong experience with device-to-device networking technologies (RDMA, GPUDirect, etc.), including RoCE.

Experience with zero-copy networking, RDMA verbs and memory registration.

Familiarity with queue pairs, completions queues, and transport types.

Strong understanding of operating systems (Linux preferred) and server hardware architectures.

Ability to analyze complex technical problems and provide effective solutions.

Excellent communication and collaboration skills.

Ability to work independently and as part of a team.

Experience with version control systems (e.g., Git).

Experience with reading and interpreting hardware logs.

Nice-to-Have Skills and Experience

Experience with networking technologies like NVLink, Infiniband, ML Pod interconnects.

Experience with widely deployed Top of Rack Switches (Cisco, Juniper, Arista, etc.)

Knowledge of server virtualization.

Experience with tracing tools like perf, eBPF, ftrace, etc.

Experience with performance testing and benchmarking tools (gProf, vTune, Wireshark, etc.).

Familiarity with hardware diagnostic tools and techniques

Experience with containerization technologies (e.g., Docker, Kubernetes).

Experience with CI / CD pipelines.

Experience with Rust.

Ideal Background

Candidates who have worked on GPU or TPU pods, specifically in the networking domain.

Candidates who understand up-time challenges of very big ML deployments.

Candidates who have actively debugged complex network topologies, specifically dealing with cases of node dropouts / failures, route-arounds, and pod resiliency at large.

Candidates must understand performance implications of Pod Networking SW.

Benefits

Full medical, dental, and vision packages, with generous premium coverage

Housing subsidy of $2,000 / month for those living within walking distance of the office

Daily lunch and dinner in our office

Relocation support for those moving to West San Jose

Compensation Range

$150,000 - $275,000

How were different

Etched believes in the Bitter Lesson. We think most of the progress in the AI field has come from using more FLOPs to train and run models, and the best way to get more FLOPs is to build model-specific hardware. Larger and larger training runs encourage companies to consolidate around fewer model architectures, which creates a market for single-model ASICs.

We are a fully in-person team in West San Jose, and greatly value engineering skills. We do not have boundaries between engineering and research, and we expect all of our technical staff to contribute to both as needed.

#J-18808-Ljbffr

Create a job alert for this search

Software Engineer • San Jose, CA, United States

Related jobs
Software Engineer, Networking

Software Engineer, Networking

Anza • San Francisco, California, United States
Remote
Full-time
Software Engineer, Networking - Anza .At Anza, we're at the forefront of blockchain technology, developing the Agave client to enhance the Solana ecosystem — a blockchain designed for rapid growth ...Show more
Last updated: 30+ days ago • Promoted
Software Engineer - Core Networking

Software Engineer - Core Networking

Apple • Cupertino, CA, United States
Full-time
Imagine what you could do here! At Apple, outstanding ideas have a way of becoming great products, services, and customer experiences very quickly. Bring passion and dedication to your job and there...Show more
Last updated: 2 days ago • Promoted
Software Engineer, Core Networking

Software Engineer, Core Networking

Apple • Cupertino, CA, United States
Full-time
Apple is where individual imaginations gather together, committing to the values that lead to great work.Every new product we build, service we create, or Apple Store experience we deliver is the r...Show more
Last updated: 30+ days ago • Promoted
Software Engineer, Network & Protocol Team (Remote)

Software Engineer, Network & Protocol Team (Remote)

Hoplynk • San Francisco, CA, United States
Remote
Full-time
Software Engineer, Network & Protocol Team (Remote).Software Engineer, Network & Protocol Team (Remote).Hoplynk is building the communications and networking layer for the intelligent edge, ensurin...Show more
Last updated: 1 day ago • Promoted
Software Engineer - Networking Software and Services

Software Engineer - Networking Software and Services

Xai • Palo Alto, CA, United States
Full-time
Software Engineer - Networking Software and Services.AIs mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge.Our team is small, ...Show more
Last updated: 30+ days ago • Promoted
Software Engineer - Networking

Software Engineer - Networking

Echo IT Solutions, Inc. • San Jose, CA, United States
Full-time
Software Engineer - Networking.Raleigh, NC | Bay Area / San Jose, CA.Design, develop, and maintain software components related to network switches and routers, including Switch Abstraction Interfac...Show more
Last updated: 30+ days ago • Promoted
Senior Network Protocol Software Engineer

Senior Network Protocol Software Engineer

Advanced Micro Devices • Santa Clara, CA, United States
Full-time
WHAT YOU DO AT AMD CHANGES EVERYTHING.At AMD, our mission is to build great products that accelerate next‑generation computing experiences—from AI and data centers, to PCs, gaming and embedded syst...Show more
Last updated: 9 days ago • Promoted
Software Engineer Networking

Software Engineer Networking

Ericsson • Santa Clara, California, United States
Full-time
Minimum Education : Bachelor's / master's degree in computer science or a related field Minimum Years of Experience : 6+ years of experience in Networking domain Strong knowledge of networking protocol...Show more
Last updated: 30+ days ago • Promoted
Senior Software Engineer - Networking Control Plane

Senior Software Engineer - Networking Control Plane

Microsoft Corporation • Santa Clara, CA, United States
Full-time
Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the team behind Microsoft's expanding Cloud Infrastructure and responsible for powering Microsoft's "Intelligent Cloud" ...Show more
Last updated: 30+ days ago • Promoted
Staff Software Engineer, SDN Networking

Staff Software Engineer, SDN Networking

Crusoe Energy Systems LLC • San Francisco, CA, United States
Full-time
Cruose's mission is to accelerate the abundance of energy and intelligence.We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, spe...Show more
Last updated: 30+ days ago • Promoted
Networking and Software Solutions Engineer

Networking and Software Solutions Engineer

Supermicro • San Jose, CA, United States
Full-time
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...Show more
Last updated: 30+ days ago • Promoted
Software Engineer, Networking

Software Engineer, Networking

Openai • San Francisco, California, United States
Full-time
The Platform Networking team is responsible for the collective communication stack used in our largest training jobs.Using a combination of C++ and CUDA we work on novel collective communication te...Show more
Last updated: 30+ days ago • Promoted
Senior Software Engineer, Networking

Senior Software Engineer, Networking

Nvidia Corporation • Santa Clara, CA, United States
Full-time
NVIDIA is looking for an excellent Software Engineer to join the InfiniBand Switch and NVLink FW group in Santa Clara, CA. As the team member, you will be part of a major development effort for the ...Show more
Last updated: 13 hours ago • Promoted • New!
Pod Networking Software Engineer

Pod Networking Software Engineer

ETCHED LLC • San Jose, CA, United States
Full-time
Etched is building the world's first AI inference system purpose-built for transformers - delivering over 10x higher performance and dramatically lower cost and latency than a B200.With Etched ASIC...Show more
Last updated: 30+ days ago • Promoted
Networking and Software Solutions Engineer

Networking and Software Solutions Engineer

Super Micro Computer • San Jose, CA, United States
Full-time
Supermicro® is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customer...Show more
Last updated: 2 days ago • Promoted
Software Engineer, Networking

Software Engineer, Networking

Foothill-De Anza Community College District • San Francisco, CA, United States
Full-time
Foothill-De Anza Community College District.At Anza, we're at the forefront of blockchain technology, developing the Agave client to enhance the Solana ecosystem — a blockchain designed for rapid g...Show more
Last updated: 6 days ago • Promoted
Senior Network Protocol Software Engineer

Senior Network Protocol Software Engineer

AMD • Santa Clara, CA, United States
Full-time
Senior Network Protocol Software Engineer.Senior Network Protocol Software Engineer.WHAT YOU DO AT AMD CHANGES EVERYTHING. At AMD, our mission is to build great products that accelerate next-generat...Show more
Last updated: 9 days ago • Promoted
Senior Software Engineer, Networking Platform

Senior Software Engineer, Networking Platform

NVIDIA • Santa Clara, CA, United States
Full-time
NVIDIA is looking for a highly motivated, creative, and passionate Software Engineer to design and develop a simulation software to integrate with many networking operating systems in the Networkin...Show more
Last updated: 2 days ago • Promoted