mathvision): the model sees the image and the question, reasons step by step, and puts its final answer in \boxed{}. The boxed answer is extracted and graded by a faithful port of the official MATH-Vision grader (evaluation/{utils,evaluate}.py) — case-insensitive exact match, tuple/list handling, and LaTeX-aware symbolic equivalence via latex2sympy2, compared against both the gold letter and the gold option text. A 304-sample mathvision_testmini task (the official balanced mini split, 19 problems per subject) is provided for fast iteration.
Grader fidelity is exact. Replaying the official repo’s own committed model outputs through Mill’s ported grader reproduces its per-sample labels with 0 mismatches across all 3,040 test samples (GPT-4V-CoT → 23.98%, Qwen-VL-Max-CoT → 12.63%, LLaVA-1.5-13B → 11.12% — each matching the published number). Without the optional
latex2sympy2 extra only 5 / 3,040 samples change (letters and plain numbers grade through string normalisation).Evaluation configuration
| Hyperparameter | Value |
|---|---|
| Benchmark | mathvision (full test, 3,040) / mathvision_testmini (304) |
| Dataset | MathLLMs/MathVision (test / testmini) |
| n-shots | 0 (zero-shot, boxed chain-of-thought) |
| Task type | GENERATIVE_QA (boxed CoT), vision-language |
| Metric | mathvision_acc (official LaTeX-aware answer matching) |
| Max new tokens | 2048 |
| Backend | HuggingFace VLM (hf) |
Reproduce
Results
Mill
22.04% ± 2.38
Qwen2.5-VL-7B, testmini (304)
Qwen2.5-VL-7B, testmini (304)
Reported
≈25%
Qwen2.5-VL-7B, full test
Qwen2.5-VL-7B, full test
Difference
≈3 pts
within sampling + harness spread
within sampling + harness spread
Mill’s
mathvision_acc is graded by an exact port of the official metric, so the score reflects the model/prompt, not the grader. Mill measures 22.04% ± 2.38% for Qwen2.5-VL-7B on the official testmini (304) split. The published full-test number is ≈25% (Qwen2.5-VL Technical Report; reported values span ~24–26% across harnesses depending on prompt and image resolution). testmini is a 304-sample subset, so its score carries a ~±2.4 pp sampling standard error on top of that spread — the two are statistically consistent (~1.2 SE). The full 3,040-sample task reproduces the same benchmark for those with more compute.Per-model results
| Model | Split | Mill (mathvision_acc) | Reference | Source |
|---|---|---|---|---|
Qwen/Qwen2.5-VL-7B-Instruct | testmini (304) | 22.04% ± 2.38 | ≈25% (full test) | Qwen2.5-VL Tech Report |