> ## Documentation Index > Fetch the complete documentation index at: https://pymill.com/llms.txt > Use this file to discover all available pages before exploring further. # Overview > How Mill reproduces published benchmark numbers, and a template for adding new ones. Mill aims to reproduce the numbers reported in model and benchmark papers. Each benchmark page in this section records the exact **evaluation configuration** (the hyperparameters that move scores), the command to reproduce it, and Mill's result side by side with the published figure. Mill writes accuracy as a fraction in `aggregate.csv` (e.g. `0.5378`); the tables in these pages show percentages to match how papers report them. The `±` column is the bootstrap standard error, in percentage points. ## Benchmarks 5-shot multiple-choice knowledge benchmark (text). 10-option chain-of-thought successor to MMLU (text). Zero-shot image classification (CLIP and VLMs). 1000-class zero-shot image classification (run pending). College-level multimodal multiple-choice (image + text). Single-word audio question answering (audio + text). ## Adding a new benchmark Create a new `.mdx` file under `docs/reproducibility/`, add it to the Reproducibility group in `docs.json`, and copy the block below. Replace the placeholders and fill the results table from the rollup row of your `aggregate.csv` (the row whose `task` equals the benchmark name, e.g. `mmlu,mmlu`). ````mdx theme={null} --- title: "" description: " reproduced with Mill versus the published result." --- ## Evaluation configuration | Hyperparameter | Value | |---|---| | Benchmark / task | `` | | n-shots | `` | | Output type | `` | | Metric | `` | | Precision (dtype) | `bfloat16` | | Backend | `` | ## Reproduce ```bash mill --output_dir ./results eval "[dtype=bfloat16]" mill --output_dir ./results collect --metric ``` ## Results ``, ``-shot ``: **** ±
measured ****
published baseline **<±delta>**
within σ Mill reproduces the []() score for `` (****) within one standard error. ### Per-model results | Model | Mill (``-shot ``) | Reported | Source | Δ | |---|---|---|---|---| | `` | **** ± | | []() | <±delta> | ````