How Mill reproduces published benchmark numbers, and a template for adding new ones.
Mill aims to reproduce the numbers reported in model and benchmark papers. Each benchmark page in this section records the exact evaluation configuration (the hyperparameters that move scores), the command to reproduce it, and Mill’s result side by side with the published figure.
Mill writes accuracy as a fraction in aggregate.csv (e.g. 0.5378); the tables in these pages show percentages to match how papers report them. The ± column is the bootstrap standard error, in percentage points.
Create a new .mdx file under docs/reproducibility/, add it to the Reproducibility group in docs.json, and copy the block below. Replace the placeholders and fill the results table from the rollup row of your aggregate.csv (the row whose task equals the benchmark name, e.g. mmlu,mmlu).
---title: "<Benchmark name>"description: "<Benchmark> reproduced with Mill versus the published result."---<One line on the task and how it is scored.>## Evaluation configuration| Hyperparameter | Value ||---|---|| Benchmark / task | `<task>` || n-shots | `<n>` || Output type | `<GENERATIVE | LOGPROBS | PERPLEXITY>` || Metric | `<metric>` || Precision (dtype) | `bfloat16` || Backend | `<hf | vllm | litellm>` |## Reproduce```bashmill --output_dir ./results eval "<model>[dtype=bfloat16]" <task>mill --output_dir ./results collect --metric <metric>```## Results`<model>`, `<n>`-shot `<metric>`:<CardGroup cols={3}> <Card title="Mill" icon="gauge"> **<xx.xx%>** ± <se> <br />measured </Card> <Card title="Reported" icon="book"> **<yy.yy%>** <br />published baseline </Card> <Card title="Difference" icon="scale-balanced"> **<±delta>** <br />within <N>σ </Card></CardGroup><Check>Mill reproduces the [<source>](<url>) score for `<model>` (**<yy.yy%>**) within one standard error.</Check>### Per-model results| Model | Mill (`<n>`-shot `<metric>`) | Reported | Source | Δ ||---|---|---|---|---|| `<model>` | **<xx.xx%>** ± <se> | <yy.yy%> | [<source>](<url>) | <±delta> |