About the Role
We are looking for an Evaluation Scientist who can work across both hands-on experimentation and automation infrastructure . This role begins with running manual evaluations (e.g., executing and monitoring individual experiments) and progresses toward building scripts, tools, and infrastructure that streamline and automate these processes, with the long-term goal of reducing manual work as much as possible.
The ideal candidate will also bring expertise in coding agents and quality evaluation , enabling them to design robust experiments and improve workflows. While the role will receive high-level guidance, candidates should be able to independently define and implement the lower-level details of experiment setup after ramping up. For example, given a high-level requirement for a new type of evaluation, the candidate should be able to propose and execute an implementation plan with detailed steps, metrics, and automation in place.
Key Responsibilities
- Run and manage manual evaluation experiments across AI / ML systems.
- Develop and maintain automation infrastructure (scripts, pipelines, tools) to reduce manual evaluation work.
- Design and execute new types of evaluations , translating broad research questions into structured experiment setups.
- Work with coding agents and applied ML workflows to define and measure quality.
- Define metrics, benchmarks, and evaluation criteria to assess performance and identify gaps.
- Collaborate with research leads to align evaluation design with project goals while owning implementation details .
- Ensure reproducibility, consistency, and scalability of evaluation processes.
Qualifications
Strong coding skills in Python (or equivalent) for scripting, automation, and experiment design.Experience with running and analyzing experiments , including quality evaluation methodologies.Knowledge of coding agents, ML models, or applied automation frameworks .Ability to work independently : take high-level requirements and define detailed steps for execution.2–4 years of hands-on experience in evaluation, scripting, or applied data science / ML (academic or industry).Strong analytical skills with experience in data handling, reporting, and experiment analysis .Preferred Skills
Familiarity with evaluation frameworks and automation tools in AI / ML research.Experience in building scalable infrastructure for experiments or evaluations.Knowledge of experimental design, statistical testing, or quality benchmarking .#J-18808-Ljbffr