Answer: $LETTER, and the answer letter is regex-extracted from the response (mirroring lighteval’s mmlu_pro task).
Evaluation configuration
| Hyperparameter | Value |
|---|---|
| Benchmark / task | mmlu_pro (TIGER-Lab/MMLU-Pro, test split) |
| n-shots | 0 (zero-shot CoT) |
| Output type | GENERATIVE (reason, then Answer: $LETTER) |
| Metric | mmlu_pro_acc (extracts the A–J letter) |
| Max new tokens | 1024 |
| Precision (dtype) | bfloat16 (Mill default) |
| Backend | HuggingFace Transformers (hf) |
MMLU-Pro is designed for chain-of-thought answering and works best with instruction-tuned models; base models often fail to emit the
Answer: $LETTER line and score near chance.Reproduce
Results
Qwen/Qwen3-0.6B-Base, 0-shot CoT mmlu_pro_acc:
Mill
21.77% ± 0.38
measured
measured
Chance
~10%
10-option random
10-option random
The MMLU-Pro paper does not report a baseline for a model this small, and base models often fail to emit the
Answer: $LETTER line, so Mill’s number is recorded as an initial baseline rather than a reproduction. Instruction-tuned models score substantially higher; add them to the table below as they are evaluated.Per-model results
| Model | Mill (CoT mmlu_pro_acc) | Reported | Source | Δ |
|---|---|---|---|---|
Qwen/Qwen3-0.6B-Base | 21.77% ± 0.38 | — | MMLU-Pro paper (no sub-1B baseline) | — |