A company is looking for a MLE (Pretraining Data) to lead the construction and scaling of large-scale training corpora for open source transformer models.
Key Responsibilities
Collecting, filtering, and synthesizing pretraining-scale datasets
Designing dataset mixtures and running controlled ablations
Developing end-to-end pipelines for collecting, processing, and evaluating datasets
Qualifications
Experience building or scaling large pretraining datasets
Experience running dataset ablations and mixture experiments
Strong Python engineering skills
Experience with distributed data processing systems
Deep understanding of how dataset composition affects model behavior
Machine Learning Engineer • Ann Arbor, Michigan, United States