> ## Documentation Index
> Fetch the complete documentation index at: https://pymill.com/llms.txt
> Use this file to discover all available pages before exploring further.

# MMLU-Pro

> MMLU-Pro reproduced with Mill — generative chain-of-thought, 10-option multiple choice.

MMLU-Pro is a harder successor to MMLU: \~12K questions across 14 categories with up to 10 answer options each. Mill runs it as **generative chain-of-thought** — the model reasons step by step and ends with `Answer: $LETTER`, and the answer letter is regex-extracted from the response (mirroring lighteval's `mmlu_pro` task).

## Evaluation configuration

| Hyperparameter    | Value                                           |
| ----------------- | ----------------------------------------------- |
| Benchmark / task  | `mmlu_pro` (`TIGER-Lab/MMLU-Pro`, `test` split) |
| n-shots           | `0` (zero-shot CoT)                             |
| Output type       | `GENERATIVE` (reason, then `Answer: $LETTER`)   |
| Metric            | `mmlu_pro_acc` (extracts the A–J letter)        |
| Max new tokens    | `1024`                                          |
| Precision (dtype) | `bfloat16` (Mill default)                       |
| Backend           | HuggingFace Transformers (`hf`)                 |

<Note>
  MMLU-Pro is designed for chain-of-thought answering and works best with instruction-tuned models; base models often fail to emit the `Answer: $LETTER` line and score near chance.
</Note>

## Reproduce

```bash theme={null}
mill --output_dir ./results eval "<model>[dtype=bfloat16]" mmlu_pro

mill --output_dir ./results collect --metric mmlu_pro_acc
```

## Results

`Qwen/Qwen3-0.6B-Base`, 0-shot CoT `mmlu_pro_acc`:

<CardGroup cols={2}>
  <Card title="Mill" icon="gauge">
    **21.77%** ± 0.38 \
    measured
  </Card>

  <Card title="Chance" icon="dice">
    **\~10%** \
    10-option random
  </Card>
</CardGroup>

<Note>
  The [MMLU-Pro paper](https://arxiv.org/abs/2406.01574) does not report a baseline for a model this small, and base models often fail to emit the `Answer: $LETTER` line, so Mill's number is recorded as an **initial baseline** rather than a reproduction. Instruction-tuned models score substantially higher; add them to the table below as they are evaluated.
</Note>

### Per-model results

| Model                  | Mill (CoT `mmlu_pro_acc`) | Reported | Source                                                                  | Δ |
| ---------------------- | ------------------------- | -------- | ----------------------------------------------------------------------- | - |
| `Qwen/Qwen3-0.6B-Base` | **21.77%** ± 0.38         | —        | [MMLU-Pro paper](https://arxiv.org/abs/2406.01574) (no sub-1B baseline) | — |
