MMLU-Pro - Mill

MMLU-Pro is a harder successor to MMLU: ~12K questions across 14 categories with up to 10 answer options each. Mill runs it as generative chain-of-thought — the model reasons step by step and ends with Answer: $LETTER, and the answer letter is regex-extracted from the response (mirroring lighteval’s mmlu_pro task).

Evaluation configuration

Hyperparameter	Value
Benchmark / task	`mmlu_pro` (`TIGER-Lab/MMLU-Pro`, `test` split)
n-shots	`0` (zero-shot CoT)
Output type	`GENERATIVE` (reason, then `Answer: $LETTER`)
Metric	`mmlu_pro_acc` (extracts the A–J letter)
Max new tokens	`1024`
Precision (dtype)	`bfloat16` (Mill default)
Backend	HuggingFace Transformers (`hf`)

MMLU-Pro is designed for chain-of-thought answering and works best with instruction-tuned models; base models often fail to emit the Answer: $LETTER line and score near chance.

Reproduce

mill --output_dir ./results eval "<model>[dtype=bfloat16]" mmlu_pro

mill --output_dir ./results collect --metric mmlu_pro_acc

Results

Qwen/Qwen3-0.6B-Base, 0-shot CoT mmlu_pro_acc:

Mill

21.77% ± 0.38
measured

Chance

~10%
10-option random

The MMLU-Pro paper does not report a baseline for a model this small, and base models often fail to emit the Answer: $LETTER line, so Mill’s number is recorded as an initial baseline rather than a reproduction. Instruction-tuned models score substantially higher; add them to the table below as they are evaluated.

Per-model results

Model	Mill (CoT `mmlu_pro_acc`)	Reported	Source	Δ
`Qwen/Qwen3-0.6B-Base`	21.77% ± 0.38	—	MMLU-Pro paper (no sub-1B baseline)	—

​Evaluation configuration

​Reproduce

​Results

Mill

Chance

​Per-model results

Evaluation configuration

Reproduce

Results

Per-model results