Research Scientists / Engineers (all levels)
Focus on Vision Data Infrastructure
- ?? Fundamental AI Research Institute
- ?? San Francisco Bay Area, USA
- ?? $250,000 - $600,000 salary + annual bonus
Come join one of the only research institutions globally with resources to compete with top AI companies =>
10s of 1000s of GPUs to explore state-of-the-art research in LLMs, Multimodal and Agentic AI.
Currently seeking AI talent with expertise in building scalable pipelines for vision data to support both image / video generative training and multi-modal alignment. You'll design high-performance pipelines for large-scale image and video datasets , enabling efficient pretraining, alignment, and simulation-based data generation.
Responsibilities :
Vision Data Sourcing & Curation
Collect and organize image and video data from open datasets and the web.Handle data cleaning, filtering, deduplication, and metadata generation.Ensure ethical and compliant data collection at scale.Processing & Augmentation
Build high-throughput pipelines for vision data preprocessing (frame extraction, resolution normalization, format conversion, latent caching).Implement GPU-accelerated augmentation and distributed data loading (WebDataset, TFRecords, Parquet).Synthetic & Simulation-Based Data Generation
Use simulation tools (e.g., Unreal Engine 5 , Isaac Sim, Unity) to generate high-quality synthetic vision dataCreate specialized datasets for VLM training visual reasoning , and agent interactionRequirements :
Strong experience with data engineering computer vision , or machine learning infrastructureExpertise in building and scaling ETL / data pipelines for large unstructured datasets.Proficiency with Python PyTorch , and distributed data frameworks (e.g., Ray Spark DaskExperience with WebDataset TFRecords Parquet , or similar high-throughput data formats.Familiarity with GPU-accelerated preprocessing NVIDIA DALI , or equivalent systems.Understanding of image / video codecs data compression , and cloud storage optimizationPreferred Experience :
Prior work with simulation-based or synthetic data generation using Unreal Engine Isaac Sim , or UnityExperience curating datasets for multimodal or vision-language model training.Knowledge of data ethics privacy , and compliance frameworks for large-scale AI datasets.Experience contributing to open datasets or data-centric AI researchWhy apply :
Opportunity to join a fast-growing core team that are already pushing AI breakthroughsHighly competitive salary packageWork alongside ambitious and bright superstars from tech and academiaMedical, Dental and Vision InsuranceRelocation package available?? San Francisco Bay Area, USA?? Interested in applying? Please click on the 'Easy Apply' button or alternatively email me your resume at stefani.lukic@storm3.com