Talent.com
Machine Learning - Infrastructure
Machine Learning - InfrastructureCausal Labs • San Francisco, CA, United States
Machine Learning - Infrastructure

Machine Learning - Infrastructure

Causal Labs • San Francisco, CA, United States
1 day ago
Job type
  • Full-time
Job description

About us

Our mission is to build causal intelligence, starting with physics models to predict and control the weather.

We're building a small team driven by a deep passion and urgency to solve this civilizationally important problem.

Our founding team has led & shipped models across self‑driving cars, humanoid robotics, protein folding, and video generation at world‑class institutions including Google DeepMind, Cruise, Waymo, Meta, Nabla Bio, and Apple.

Responsibilities

  • Design, deploy, and maintain large distributed ML training and inference clusters
  • Develop efficient, scalable end‑to‑end pipelines to manage petabyte‑scale datasets and model training throughout the entire ML lifecycle
  • Research and test various training approaches including parallelization techniques and numerical precision trade‑offs across different model scales
  • Analyze, profile and debug low‑level GPU operations to optimize performance
  • Stay up‑to‑date on research to bring new ideas to work

What we’re looking for

We value a relentless approach to problem‑solving, rapid execution, and the ability to quickly learn in unfamiliar domains.

  • Strong grasp of state‑of‑the‑art techniques for optimizing training and inference workloads
  • Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models
  • Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML / AI service offerings
  • Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)
  • Background working on distributed task management systems and scalable model serving & deployment architectures
  • Understanding of monitoring, logging, observability, and version control best practices for ML systems
  • You don’t have to meet every single requirement above.

    Benefits

  • Work on deeply challenging, unsolved problems
  • Competitive cash and equity compensation
  • Medical, dental, and vision insurance
  • Catered lunch & dinner
  • Unlimited paid time off
  • Visa sponsorship & relocation support
  • #J-18808-Ljbffr

    Create a job alert for this search

    Machine Learning • San Francisco, CA, United States