Skip to main content
MMLU is run as a 5-shot multiple-choice task. Each answer option is scored by its log-probability (LOGPROBS output type) and acc is the fraction of questions where the gold option ranks highest. The two hyperparameters that matter most for matching published numbers are the 5-shot prompt and log-prob (rather than generative) scoring.

Evaluation configuration

HyperparameterValue
Benchmark / taskmmlu (57 subtasks, averaged)
n-shots5
Output typeLOGPROBS (rank options by log-probability)
Metricacc
Precision (dtype)bfloat16 (Mill default)
BackendHuggingFace Transformers (hf)

Reproduce

mill --output_dir ./results eval "Qwen/Qwen3-0.6B-Base[dtype=bfloat16]" mmlu

mill --output_dir ./results collect --metric acc

Results

Qwen/Qwen3-0.6B-Base, 5-shot acc:

Mill

53.78% ± 1.53
measured

Qwen3 report

52.81%
published baseline

Difference

+0.97
within 1σ
Mill reproduces the Qwen3 Technical Report MMLU score for Qwen/Qwen3-0.6B-Base (52.81%) within one standard error.

Per-model results

ModelMill (5-shot acc)ReportedSourceΔ
Qwen/Qwen3-0.6B-Base53.78% ± 1.5352.81%Qwen3 Technical Report+0.97
Other models measured with Mill (5-shot acc). Published MMLU numbers for these vary by eval protocol across sources, so they are recorded as Mill measurements pending a matched, like-for-like comparison:
ModelMill (5-shot acc)
Qwen/Qwen3-1.7B-Base65.10% ± 1.79
Qwen/Qwen3-1.7B61.14% ± 1.70
Qwen/Qwen3-VL-2B-Instruct63.75% ± 1.77
SubtaskAcc (%)±
mmlu_abstract_algebra35.003.38
mmlu_anatomy44.443.03
mmlu_astronomy50.662.87
mmlu_business_ethics56.003.52
mmlu_clinical_knowledge54.342.17
mmlu_college_biology56.252.93
mmlu_college_chemistry46.003.53
mmlu_college_computer_science45.003.53
mmlu_college_mathematics41.003.49
mmlu_college_medicine52.022.69
mmlu_college_physics34.313.33
mmlu_computer_security72.003.18
mmlu_conceptual_physics50.212.31
mmlu_econometrics41.233.27
mmlu_electrical_engineering58.622.90
mmlu_elementary_mathematics45.771.81
mmlu_formal_logic41.273.11
mmlu_global_facts39.003.46
mmlu_high_school_biology70.321.84
mmlu_high_school_chemistry50.252.48
mmlu_high_school_computer_science61.003.46
mmlu_high_school_european_history66.062.61
mmlu_high_school_geography64.652.41
mmlu_high_school_government_and_politics65.282.43
mmlu_high_school_macroeconomics54.621.78
mmlu_high_school_mathematics35.932.07
mmlu_high_school_microeconomics58.402.26
mmlu_high_school_physics37.092.78
mmlu_high_school_psychology74.681.32
mmlu_high_school_statistics51.393.41
mmlu_high_school_us_history55.393.49
mmlu_high_school_world_history64.983.11
mmlu_human_aging53.813.35
mmlu_human_sexuality65.654.16
mmlu_international_law62.814.41
mmlu_jurisprudence61.114.71
mmlu_logical_fallacies60.743.84
mmlu_machine_learning42.864.70
mmlu_management67.964.62
mmlu_marketing79.492.65
mmlu_medical_genetics55.005.00
mmlu_miscellaneous61.051.74
mmlu_moral_disputes58.092.66
mmlu_moral_scenarios23.691.42
mmlu_nutrition59.482.81
mmlu_philosophy55.632.82
mmlu_prehistory53.402.78
mmlu_professional_accounting37.232.88
mmlu_professional_law34.881.22
mmlu_professional_medicine53.683.03
mmlu_professional_psychology50.332.02
mmlu_public_relations60.914.67
mmlu_security_studies61.633.11
mmlu_sociology69.153.27
mmlu_us_foreign_policy62.004.88
mmlu_virology46.393.88
mmlu_world_religions55.563.81