MMLU - Mill

MMLU is run as a 5-shot multiple-choice task. Each answer option is scored by its log-probability (LOGPROBS output type) and acc is the fraction of questions where the gold option ranks highest. The two hyperparameters that matter most for matching published numbers are the 5-shot prompt and log-prob (rather than generative) scoring.

Evaluation configuration

Hyperparameter	Value
Benchmark / task	`mmlu` (57 subtasks, averaged)
n-shots	`5`
Output type	`LOGPROBS` (rank options by log-probability)
Metric	`acc`
Precision (dtype)	`bfloat16` (Mill default)
Backend	HuggingFace Transformers (`hf`)

Reproduce

mill --output_dir ./results eval "Qwen/Qwen3-0.6B-Base[dtype=bfloat16]" mmlu

mill --output_dir ./results collect --metric acc

Results

Qwen/Qwen3-0.6B-Base, 5-shot acc:

Mill

53.78% ± 1.53
measured

Qwen3 report

52.81%
published baseline

Difference

+0.97
within 1σ

Mill reproduces the Qwen3 Technical Report MMLU score for Qwen/Qwen3-0.6B-Base (52.81%) within one standard error.

Per-model results

Model	Mill (5-shot `acc`)	Reported	Source	Δ
`Qwen/Qwen3-0.6B-Base`	53.78% ± 1.53	52.81%	Qwen3 Technical Report	+0.97

Other models measured with Mill (5-shot acc). Published MMLU numbers for these vary by eval protocol across sources, so they are recorded as Mill measurements pending a matched, like-for-like comparison:

Model	Mill (5-shot `acc`)
`Qwen/Qwen3-1.7B-Base`	65.10% ± 1.79
`Qwen/Qwen3-1.7B`	61.14% ± 1.70
`Qwen/Qwen3-VL-2B-Instruct`	63.75% ± 1.77

Per-subtask breakdown (57 tasks)

Subtask	Acc (%)	±
`mmlu_abstract_algebra`	35.00	3.38
`mmlu_anatomy`	44.44	3.03
`mmlu_astronomy`	50.66	2.87
`mmlu_business_ethics`	56.00	3.52
`mmlu_clinical_knowledge`	54.34	2.17
`mmlu_college_biology`	56.25	2.93
`mmlu_college_chemistry`	46.00	3.53
`mmlu_college_computer_science`	45.00	3.53
`mmlu_college_mathematics`	41.00	3.49
`mmlu_college_medicine`	52.02	2.69
`mmlu_college_physics`	34.31	3.33
`mmlu_computer_security`	72.00	3.18
`mmlu_conceptual_physics`	50.21	2.31
`mmlu_econometrics`	41.23	3.27
`mmlu_electrical_engineering`	58.62	2.90
`mmlu_elementary_mathematics`	45.77	1.81
`mmlu_formal_logic`	41.27	3.11
`mmlu_global_facts`	39.00	3.46
`mmlu_high_school_biology`	70.32	1.84
`mmlu_high_school_chemistry`	50.25	2.48
`mmlu_high_school_computer_science`	61.00	3.46
`mmlu_high_school_european_history`	66.06	2.61
`mmlu_high_school_geography`	64.65	2.41
`mmlu_high_school_government_and_politics`	65.28	2.43
`mmlu_high_school_macroeconomics`	54.62	1.78
`mmlu_high_school_mathematics`	35.93	2.07
`mmlu_high_school_microeconomics`	58.40	2.26
`mmlu_high_school_physics`	37.09	2.78
`mmlu_high_school_psychology`	74.68	1.32
`mmlu_high_school_statistics`	51.39	3.41
`mmlu_high_school_us_history`	55.39	3.49
`mmlu_high_school_world_history`	64.98	3.11
`mmlu_human_aging`	53.81	3.35
`mmlu_human_sexuality`	65.65	4.16
`mmlu_international_law`	62.81	4.41
`mmlu_jurisprudence`	61.11	4.71
`mmlu_logical_fallacies`	60.74	3.84
`mmlu_machine_learning`	42.86	4.70
`mmlu_management`	67.96	4.62
`mmlu_marketing`	79.49	2.65
`mmlu_medical_genetics`	55.00	5.00
`mmlu_miscellaneous`	61.05	1.74
`mmlu_moral_disputes`	58.09	2.66
`mmlu_moral_scenarios`	23.69	1.42
`mmlu_nutrition`	59.48	2.81
`mmlu_philosophy`	55.63	2.82
`mmlu_prehistory`	53.40	2.78
`mmlu_professional_accounting`	37.23	2.88
`mmlu_professional_law`	34.88	1.22
`mmlu_professional_medicine`	53.68	3.03
`mmlu_professional_psychology`	50.33	2.02
`mmlu_public_relations`	60.91	4.67
`mmlu_security_studies`	61.63	3.11
`mmlu_sociology`	69.15	3.27
`mmlu_us_foreign_policy`	62.00	4.88
`mmlu_virology`	46.39	3.88
`mmlu_world_religions`	55.56	3.81

​Evaluation configuration

​Reproduce

​Results

Mill

Qwen3 report

Difference

​Per-model results

Evaluation configuration

Reproduce

Results

Per-model results