Machine Learning - Infrastructure

Causal LabsSan Francisco, CA, United States

Hace 1 día

Tipo de contrato

A tiempo completo

Descripción del trabajo

About us

Our mission is to build causal intelligence, starting with physics models to predict and control the weather.

We're building a small team driven by a deep passion and urgency to solve this civilizationally important problem.

Our founding team has led & shipped models across self‑driving cars, humanoid robotics, protein folding, and video generation at world‑class institutions including Google DeepMind, Cruise, Waymo, Meta, Nabla Bio, and Apple.

Responsibilities

Design, deploy, and maintain large distributed ML training and inference clusters
Develop efficient, scalable end‑to‑end pipelines to manage petabyte‑scale datasets and model training throughout the entire ML lifecycle
Research and test various training approaches including parallelization techniques and numerical precision trade‑offs across different model scales
Analyze, profile and debug low‑level GPU operations to optimize performance
Stay up‑to‑date on research to bring new ideas to work

What we’re looking for

We value a relentless approach to problem‑solving, rapid execution, and the ability to quickly learn in unfamiliar domains.

Strong grasp of state‑of‑the‑art techniques for optimizing training and inference workloads

Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models

Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML / AI service offerings

Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)

Background working on distributed task management systems and scalable model serving & deployment architectures

Understanding of monitoring, logging, observability, and version control best practices for ML systems

You don’t have to meet every single requirement above.

Benefits

Work on deeply challenging, unsolved problems

Competitive cash and equity compensation

Medical, dental, and vision insurance

Catered lunch & dinner

Unlimited paid time off

Visa sponsorship & relocation support

#J-18808-Ljbffr

Crear una alerta de empleo para esta búsqueda

Machine Learning • San Francisco, CA, United States