Job Description
Job Description
About Us
Clockwork.io – A Software-Driven Revolution in AI Networking
Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI networking, delivering deterministic time, ultra-low latency, and seamless scalability for modern distributed systems.
To learn more, visit www.clockwork.io.
About the Role
We are looking for an experienced software engineer to help build, optimize, and maintain large-scale distributed training infrastructure based on the PyTorch ecosystem. This role focuses on production-grade training workflows involving multi-GPU and multi-node orchestration, high-performance communication layers, and advanced parallelism strategies.
You'll work alongside infrastructure and machine learning teams to ensure training jobs are efficient, scalable, and resilient.
What You will do
- Develop and support distributed PyTorch training jobs using torch.distributed / c10d
- Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks
- Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)
- Optimize performance across communication, I / O, and memory bottlenecks
- Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs
- Write tooling and scripts to streamline training workflows and experiment management
- Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)
What We're Looking For
Deep experience with PyTorch and torch.distributed (c10d)Hands-on experience with at least one of : Megatron-LM, DeepSpeed, or FairScaleProficiency in Python and Linux shell scriptingExperience with multi-node GPU clusters using Slurm, Kubernetes, or similarStrong understanding of NCCL, collective communication, and GPU topologyFamiliarity with debugging tools and techniques for distributed systemsPreferred Skills
Experience scaling LLM training across 8+ GPUs and multiple nodesKnowledge of tensor, pipeline, and data parallelismFamiliarity with containerized training environments (Docker, Singularity)Exposure to HPC environments or cloud GPU infrastructureExperience with training workload orchestration tools or custom job launchersComfort with large-scale checkpointing, resume / restart logic, and model I / OBonus Skills
Profiling tools : PyTorch Profiler, Nsight, nvprof, or equivalentExperience with performance tuning in distributed training environmentsContributions to ML infrastructure open-source projectsFamiliarity with storage, networking, or RDMA / GPU Direct technologiesUnderstanding of observability in ML pipelines (metrics, logs, dashboards)Enjoy
Challenging projects.A friendly and inclusive workplace culture.Competitive compensation.A great benefits package.Catered lunchClockwork Systems is an equal opportunity employer. We are committed to building world-class teams by welcoming bright, passionate individuals from all backgrounds. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity or expression, national origin, disability, or protected veteran status. We believe diversity drives innovation, and we grow stronger together.