Apply to the open roles at Delphi Ventures' portfolio companies.

companies
Jobs

Full TimeMLE (Pretraining Data)

Nous Research

Nous Research

Posted on Feb 25, 2026

We’re looking for a MLE (Pretraining Data) to lead construction and scaling of large-scale training corpora for frontier, open source transformer models. You’ll focus on dataset design, filtering, synthetic data generation, mixture experiments, and empirical evaluation to improve model quality at scale.

Responsibilities:

  • Collecting, filtering, and synthesizing pretraining-scale datasets
  • Designing dataset mixtures and running controlled ablations
  • Performing dataset comparisons and empirical evaluations across training runs
  • Developing end-to-end pipelines for collecting, processing, and evaluating datasets
  • Scaling and maintaining large training corpora across diverse sources
  • Collaborating with training and infrastructure teams to align data strategy with model scaling


Qualifications:

  • Experience building or scaling large pretraining datasets
  • Experience running dataset ablations and mixture experiments
  • Strong Python engineering skills
  • Experience with distributed data processing systems
  • Deep understanding of how dataset composition affects model behavior


Preferred:

  • Experience with distributed data processing frameworks such as Datatrove, Dask, Spark, or similar systems for large-scale dataset construction and transformation
  • Familiarity with synthetic data orchestration systems (e.g., NeMo DataDesigner) and large-scale generation, filtering, and evaluation workflows
  • Experience working with or building large-scale curated datasets similar like FineData, specifically FineWebEDU and FinePDFs
  • Familiarity with open model training initiatives such as SmolLM, BLOOM (BigScience), and Nemotron, including exposure to pretraining mixtures, scaling, and evaluation