> ## Documentation Index
> Fetch the complete documentation index at: https://pymill.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

> How Mill reproduces published benchmark numbers, and a template for adding new ones.

Mill aims to reproduce the numbers reported in model and benchmark papers. Each benchmark page in this section records the exact **evaluation configuration** (the hyperparameters that move scores), the command to reproduce it, and Mill's result side by side with the published figure.

<Note>
  Mill writes accuracy as a fraction in `aggregate.csv` (e.g. `0.5378`); the tables in these pages show percentages to match how papers report them. The `±` column is the bootstrap standard error, in percentage points.
</Note>

## Benchmarks

<CardGroup cols={2}>
  <Card title="MMLU" icon="list-check" href="/docs/reproducibility/mmlu">
    5-shot multiple-choice knowledge benchmark (text).
  </Card>

  <Card title="MMLU-Pro" icon="layer-group" href="/docs/reproducibility/mmlu-pro">
    10-option chain-of-thought successor to MMLU (text).
  </Card>

  <Card title="CIFAR-10" icon="image" href="/docs/reproducibility/cifar10">
    Zero-shot image classification (CLIP and VLMs).
  </Card>

  <Card title="ImageNet-1k" icon="images" href="/docs/reproducibility/imagenet">
    1000-class zero-shot image classification (run pending).
  </Card>

  <Card title="MMMU-Pro" icon="layer-group" href="/docs/reproducibility/mmmu_pro">
    College-level multimodal multiple-choice (image + text).
  </Card>

  <Card title="Clotho-AQA" icon="waveform-lines" href="/docs/reproducibility/clotho_aqa">
    Single-word audio question answering (audio + text).
  </Card>
</CardGroup>

## Adding a new benchmark

Create a new `.mdx` file under `docs/reproducibility/`, add it to the Reproducibility group in `docs.json`, and copy the block below. Replace the placeholders and fill the results table from the rollup row of your `aggregate.csv` (the row whose `task` equals the benchmark name, e.g. `mmlu,mmlu`).

````mdx theme={null}
---
title: "<Benchmark name>"
description: "<Benchmark> reproduced with Mill versus the published result."
---

<One line on the task and how it is scored.>

## Evaluation configuration

| Hyperparameter | Value |
|---|---|
| Benchmark / task | `<task>` |
| n-shots | `<n>` |
| Output type | `<GENERATIVE | LOGPROBS | PERPLEXITY>` |
| Metric | `<metric>` |
| Precision (dtype) | `bfloat16` |
| Backend | `<hf | vllm | litellm>` |

## Reproduce

```bash
mill --output_dir ./results eval "<model>[dtype=bfloat16]" <task>
mill --output_dir ./results collect --metric <metric>
```

## Results

`<model>`, `<n>`-shot `<metric>`:

<CardGroup cols={3}>
  <Card title="Mill" icon="gauge">
    **<xx.xx%>** ± <se>
    <br />measured
  </Card>
  <Card title="Reported" icon="book">
    **<yy.yy%>**
    <br />published baseline
  </Card>
  <Card title="Difference" icon="scale-balanced">
    **<±delta>**
    <br />within <N>σ
  </Card>
</CardGroup>

<Check>
Mill reproduces the [<source>](<url>) score for `<model>` (**<yy.yy%>**) within one standard error.
</Check>

### Per-model results

| Model | Mill (`<n>`-shot `<metric>`) | Reported | Source | Δ |
|---|---|---|---|---|
| `<model>` | **<xx.xx%>** ± <se> | <yy.yy%> | [<source>](<url>) | <±delta> |
````
