standard (10 options) config in two renderings of the same data:
mmmu_pro— generative chain-of-thought for vision-language models: the model sees the image(s) and lettered options, reasons step by step, and ends withAnswer: $LETTER, which is extracted and graded by a faithful port of the official MMMU-Pro grader.mmmu_pro_clip— a CLIP zero-shot rendering for image–text encoders: each option is scored asquestion + optionagainst the (first) image by image–text similarity. CLIP sees one image and truncates to its 77-token context, so this is a deliberately weak read — it exists for CLIP-family coverage, not as a strong solver.
Evaluation configuration
| Hyperparameter | Value |
|---|---|
| Benchmark | mmmu_pro (auto-picks mmmu_pro for VLMs, mmmu_pro_clip for CLIP) |
| Dataset | MMMU/MMMU_Pro, standard (10 options) (test, ~1,730 samples) |
| n-shots | 0 (zero-shot CoT) |
| Task type | MULTIPLE_CHOICE generative (VLM) / ZERO_SHOT_CLASSIFICATION (CLIP) |
| Metric | mmmu_pro_acc / acc |
| Max new tokens | 1024 (VLM CoT) |
| Backend | HuggingFace VLM (hf) / open_clip (clip) |
Reproduce
Results
Qwen3-VL-2B-Instruct
31.91% ± 1.12
generative CoT
generative CoT
CLIP — ViT-B-32 (laion2b)
17.86% ± 0.92
zero-shot (near-chance)
zero-shot (near-chance)
There is no widely-published MMMU-Pro
standard number for these exact checkpoints, so Mill’s measurements are recorded as the initial baselines. For context, the similarly-sized Qwen2.5-VL-3B reports ≈31.6% on MMMU-Pro, consistent with Mill’s Qwen3-VL-2B result. The CLIP rendering sits just above the 10-option chance rate (10%), as expected for a single-image, 77-token similarity read.Per-model results
| Model | Rendering | Mill (mmmu_pro_acc / acc) | Source |
|---|---|---|---|
Qwen/Qwen3-VL-2B-Instruct | Generative CoT | 31.91% ± 1.12 | Mill measurement (initial baseline) |
ViT-B-32/laion2b_s34b_b79k | CLIP zero-shot | 17.86% ± 0.92 | Mill measurement (CLIP-family coverage) |