Skip to main content
MMLU-Pro is a harder successor to MMLU: ~12K questions across 14 categories with up to 10 answer options each. Mill runs it as generative chain-of-thought — the model reasons step by step and ends with Answer: $LETTER, and the answer letter is regex-extracted from the response (mirroring lighteval’s mmlu_pro task).

Evaluation configuration

HyperparameterValue
Benchmark / taskmmlu_pro (TIGER-Lab/MMLU-Pro, test split)
n-shots0 (zero-shot CoT)
Output typeGENERATIVE (reason, then Answer: $LETTER)
Metricmmlu_pro_acc (extracts the A–J letter)
Max new tokens1024
Precision (dtype)bfloat16 (Mill default)
BackendHuggingFace Transformers (hf)
MMLU-Pro is designed for chain-of-thought answering and works best with instruction-tuned models; base models often fail to emit the Answer: $LETTER line and score near chance.

Reproduce

mill --output_dir ./results eval "<model>[dtype=bfloat16]" mmlu_pro

mill --output_dir ./results collect --metric mmlu_pro_acc

Results

Qwen/Qwen3-0.6B-Base, 0-shot CoT mmlu_pro_acc:

Mill

21.77% ± 0.38
measured

Chance

~10%
10-option random
The MMLU-Pro paper does not report a baseline for a model this small, and base models often fail to emit the Answer: $LETTER line, so Mill’s number is recorded as an initial baseline rather than a reproduction. Instruction-tuned models score substantially higher; add them to the table below as they are evaluated.

Per-model results

ModelMill (CoT mmlu_pro_acc)ReportedSourceΔ
Qwen/Qwen3-0.6B-Base21.77% ± 0.38MMLU-Pro paper (no sub-1B baseline)