Skip to main content
Mill aims to reproduce the numbers reported in model and benchmark papers. Each benchmark page in this section records the exact evaluation configuration (the hyperparameters that move scores), the command to reproduce it, and Mill’s result side by side with the published figure.
Mill writes accuracy as a fraction in aggregate.csv (e.g. 0.5378); the tables in these pages show percentages to match how papers report them. The ± column is the bootstrap standard error, in percentage points.

Benchmarks

MMLU

5-shot multiple-choice knowledge benchmark (text).

MMLU-Pro

10-option chain-of-thought successor to MMLU (text).

CIFAR-10

Zero-shot image classification (CLIP and VLMs).

ImageNet-1k

1000-class zero-shot image classification (run pending).

MMMU-Pro

College-level multimodal multiple-choice (image + text).

Clotho-AQA

Single-word audio question answering (audio + text).

Adding a new benchmark

Create a new .mdx file under docs/reproducibility/, add it to the Reproducibility group in docs.json, and copy the block below. Replace the placeholders and fill the results table from the rollup row of your aggregate.csv (the row whose task equals the benchmark name, e.g. mmlu,mmlu).
---
title: "<Benchmark name>"
description: "<Benchmark> reproduced with Mill versus the published result."
---

<One line on the task and how it is scored.>

## Evaluation configuration

| Hyperparameter | Value |
|---|---|
| Benchmark / task | `<task>` |
| n-shots | `<n>` |
| Output type | `<GENERATIVE | LOGPROBS | PERPLEXITY>` |
| Metric | `<metric>` |
| Precision (dtype) | `bfloat16` |
| Backend | `<hf | vllm | litellm>` |

## Reproduce

```bash
mill --output_dir ./results eval "<model>[dtype=bfloat16]" <task>
mill --output_dir ./results collect --metric <metric>
```

## Results

`<model>`, `<n>`-shot `<metric>`:

<CardGroup cols={3}>
  <Card title="Mill" icon="gauge">
    **<xx.xx%>** ± <se>
    <br />measured
  </Card>
  <Card title="Reported" icon="book">
    **<yy.yy%>**
    <br />published baseline
  </Card>
  <Card title="Difference" icon="scale-balanced">
    **<±delta>**
    <br />within <N>σ
  </Card>
</CardGroup>

<Check>
Mill reproduces the [<source>](<url>) score for `<model>` (**<yy.yy%>**) within one standard error.
</Check>

### Per-model results

| Model | Mill (`<n>`-shot `<metric>`) | Reported | Source | Δ |
|---|---|---|---|---|
| `<model>` | **<xx.xx%>** ± <se> | <yy.yy%> | [<source>](<url>) | <±delta> |