Skip to main content
MMMU-Pro is a robust, college-level multimodal benchmark: questions carry up to 7 images and 10 answer options (A–J). Mill runs the standard (10 options) config in two renderings of the same data:
  • mmmu_progenerative chain-of-thought for vision-language models: the model sees the image(s) and lettered options, reasons step by step, and ends with Answer: $LETTER, which is extracted and graded by a faithful port of the official MMMU-Pro grader.
  • mmmu_pro_clip — a CLIP zero-shot rendering for image–text encoders: each option is scored as question + option against the (first) image by image–text similarity. CLIP sees one image and truncates to its 77-token context, so this is a deliberately weak read — it exists for CLIP-family coverage, not as a strong solver.

Evaluation configuration

HyperparameterValue
Benchmarkmmmu_pro (auto-picks mmmu_pro for VLMs, mmmu_pro_clip for CLIP)
DatasetMMMU/MMMU_Pro, standard (10 options) (test, ~1,730 samples)
n-shots0 (zero-shot CoT)
Task typeMULTIPLE_CHOICE generative (VLM) / ZERO_SHOT_CLASSIFICATION (CLIP)
Metricmmmu_pro_acc / acc
Max new tokens1024 (VLM CoT)
BackendHuggingFace VLM (hf) / open_clip (clip)

Reproduce

# Vision-language model (generative chain-of-thought)
mill --output_dir ./results eval \
  "Qwen/Qwen3-VL-2B-Instruct[dtype=bfloat16]" mmmu_pro

# CLIP zero-shot rendering
mill --output_dir ./results eval \
  "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" mmmu_pro

mill --output_dir ./results collect --metric mmmu_pro_acc

Results

Qwen3-VL-2B-Instruct

31.91% ± 1.12
generative CoT

CLIP — ViT-B-32 (laion2b)

17.86% ± 0.92
zero-shot (near-chance)
There is no widely-published MMMU-Pro standard number for these exact checkpoints, so Mill’s measurements are recorded as the initial baselines. For context, the similarly-sized Qwen2.5-VL-3B reports ≈31.6% on MMMU-Pro, consistent with Mill’s Qwen3-VL-2B result. The CLIP rendering sits just above the 10-option chance rate (10%), as expected for a single-image, 77-token similarity read.

Per-model results

ModelRenderingMill (mmmu_pro_acc / acc)Source
Qwen/Qwen3-VL-2B-InstructGenerative CoT31.91% ± 1.12Mill measurement (initial baseline)
ViT-B-32/laion2b_s34b_b79kCLIP zero-shot17.86% ± 0.92Mill measurement (CLIP-family coverage)