Overview
We are building a distributed LLM inference network that combines idle GPU capacity from around the world into a single cohesive plane of compute for running large-language models like DeepSeek and Llama 4. At any given moment, we have over 5,000 GPUs and hundreds of terabytes of VRAM connected to the network. We are a small, well-funded team working on difficult, high-impact problems at the intersection of AI and distributed systems. We primarily work in-person from our office in downtown San Francisco.
Responsibilities
- Design and implement optimization techniques to increase model throughput and reduce latency across our suite of models
- Deploy and maintain large language models at scale in production environments
- Deploy new models as they are released by frontier labs
- Implement techniques like quantization, speculative decoding, and KV cache reuse
- Contribute regularly to open source projects such as SGLang and vLLM
- Deep dive into underlying codebases of TensorRT, PyTorch, TensorRT-LLM, vLLM, SGLang, CUDA, and other libraries to debug ML performance issues
- Collaborate with the engineering team to bring new features and capabilities to our inference platform
- Develop robust and scalable infrastructure for AI model serving
- Create and maintain technical documentation for inference systems
Requirements
3+ years of experience writing high-performance, production-quality codeStrong proficiency with Python and deep learning frameworks, particularly PyTorchDemonstrated experience with LLM inference optimization techniquesHands-on experience with SGLang and vLLM, with contributions to these projects strongly preferredFamiliarity with Docker and Kubernetes for containerized deploymentsExperience with CUDA programming and GPU optimizationStrong understanding of distributed systems and scalability challengesProven track record of optimizing AI models for production environmentsNice to Have
Familiarity with TensorRT and TensorRT-LLMKnowledge of vision models and multimodal AI systemsExperience implementing techniques like quantization and speculative decodingContributions to open source machine learning projectsExperience with large-scale distributed computingCompensation
We offer competitive compensation, equity in a high-growth startup, and comprehensive benefits. The base salary range for this role is $180,000 - $250,000, plus competitive equity and benefits including :
Full healthcare coverageQuarterly offsitesFlexible PTOSkills : pytorch, gpu optimization, deep learning frameworks, sglang, vllm, cuda programming, machine learning, python, llm
#J-18808-Ljbffr