A company is looking for a Research Engineer - Distributed Training.
Key Responsibilities :
Design, implement, and maintain the distributed LLM training pipeline
Orchestrate multi-node, multi-GPU runs across Kubernetes and internal clusters
Optimize performance, memory, and cost across large training workloads
Required Qualifications :
Strong background in PyTorch and distributed training (DeepSpeed, FSDP, Accelerate)
Hands-on experience with large-scale multi-GPU or multi-node training
Familiarity with Transformers, Datasets, and mixed-precision techniques
Understanding of GPUs, containers, and schedulers (Kubernetes, Slurm)
Mindset for reliability, performance, and clean engineering
Research Engineer • Winston Salem, North Carolina, United States