Skip to main content
MATH-Vision (MATH-V) is a multimodal mathematical reasoning benchmark: 3,040 competition problems, each paired with a visual context (figure, diagram, or plot), spanning 16 subjects and 5 difficulty levels. Each problem is either multiple-choice (5 options, A–E) or free-form (a numeric/LaTeX answer). Mill runs it as a single generative task for vision-language models (mathvision): the model sees the image and the question, reasons step by step, and puts its final answer in \boxed{}. The boxed answer is extracted and graded by a faithful port of the official MATH-Vision grader (evaluation/{utils,evaluate}.py) — case-insensitive exact match, tuple/list handling, and LaTeX-aware symbolic equivalence via latex2sympy2, compared against both the gold letter and the gold option text. A 304-sample mathvision_testmini task (the official balanced mini split, 19 problems per subject) is provided for fast iteration.
Grader fidelity is exact. Replaying the official repo’s own committed model outputs through Mill’s ported grader reproduces its per-sample labels with 0 mismatches across all 3,040 test samples (GPT-4V-CoT → 23.98%, Qwen-VL-Max-CoT → 12.63%, LLaVA-1.5-13B → 11.12% — each matching the published number). Without the optional latex2sympy2 extra only 5 / 3,040 samples change (letters and plain numbers grade through string normalisation).

Evaluation configuration

HyperparameterValue
Benchmarkmathvision (full test, 3,040) / mathvision_testmini (304)
DatasetMathLLMs/MathVision (test / testmini)
n-shots0 (zero-shot, boxed chain-of-thought)
Task typeGENERATIVE_QA (boxed CoT), vision-language
Metricmathvision_acc (official LaTeX-aware answer matching)
Max new tokens2048
BackendHuggingFace VLM (hf)

Reproduce

# Optional: exact symbolic-equivalence grading (matches the official grader)
pip install 'mill-eval[mathvision]'

# Official 304-sample mini split (fast)
mill --output_dir ./results eval \
  "Qwen/Qwen2.5-VL-7B-Instruct[dtype=bfloat16,batch_size=4]" mathvision_testmini

# Full 3,040-sample test split (the canonical benchmark)
mill --output_dir ./results eval \
  "Qwen/Qwen2.5-VL-7B-Instruct[dtype=bfloat16,batch_size=4]" mathvision

mill --output_dir ./results collect --metric mathvision_acc

Results

Mill

22.04% ± 2.38
Qwen2.5-VL-7B, testmini (304)

Reported

≈25%
Qwen2.5-VL-7B, full test

Difference

≈3 pts
within sampling + harness spread
Mill’s mathvision_acc is graded by an exact port of the official metric, so the score reflects the model/prompt, not the grader. Mill measures 22.04% ± 2.38% for Qwen2.5-VL-7B on the official testmini (304) split. The published full-test number is ≈25% (Qwen2.5-VL Technical Report; reported values span ~24–26% across harnesses depending on prompt and image resolution). testmini is a 304-sample subset, so its score carries a ~±2.4 pp sampling standard error on top of that spread — the two are statistically consistent (~1.2 SE). The full 3,040-sample task reproduces the same benchmark for those with more compute.

Per-model results

ModelSplitMill (mathvision_acc)ReferenceSource
Qwen/Qwen2.5-VL-7B-Instructtestmini (304)22.04% ± 2.38≈25% (full test)Qwen2.5-VL Tech Report