MATH-Vision

MATH-Vision (MATH-V) is a multimodal mathematical reasoning benchmark: 3,040 competition problems, each paired with a visual context (figure, diagram, or plot), spanning 16 subjects and 5 difficulty levels. Each problem is either multiple-choice (5 options, A–E) or free-form (a numeric/LaTeX answer). Mill runs it as a single generative task for vision-language models (mathvision): the model sees the image and the question, reasons step by step, and puts its final answer in \boxed{}. The boxed answer is extracted and graded by a faithful port of the official MATH-Vision grader (evaluation/{utils,evaluate}.py) — case-insensitive exact match, tuple/list handling, and LaTeX-aware symbolic equivalence via latex2sympy2, compared against both the gold letter and the gold option text. A 304-sample mathvision_testmini task (the official balanced mini split, 19 problems per subject) is provided for fast iteration.

Grader fidelity is exact. Replaying the official repo’s own committed model outputs through Mill’s ported grader reproduces its per-sample labels with 0 mismatches across all 3,040 test samples (GPT-4V-CoT → 23.98%, Qwen-VL-Max-CoT → 12.63%, LLaVA-1.5-13B → 11.12% — each matching the published number). Without the optional latex2sympy2 extra only 5 / 3,040 samples change (letters and plain numbers grade through string normalisation).

Evaluation configuration

Hyperparameter	Value
Benchmark	`mathvision` (full `test`, 3,040) / `mathvision_testmini` (304)
Dataset	`MathLLMs/MathVision` (`test` / `testmini`)
n-shots	`0` (zero-shot, boxed chain-of-thought)
Task type	`GENERATIVE_QA` (boxed CoT), vision-language
Metric	`mathvision_acc` (official LaTeX-aware answer matching)
Max new tokens	`2048`
Backend	HuggingFace VLM (`hf`)

Reproduce

# Optional: exact symbolic-equivalence grading (matches the official grader)
pip install 'mill-eval[mathvision]'

# Official 304-sample mini split (fast)
mill --output_dir ./results eval \
  "Qwen/Qwen2.5-VL-7B-Instruct[dtype=bfloat16,batch_size=4]" mathvision_testmini

# Full 3,040-sample test split (the canonical benchmark)
mill --output_dir ./results eval \
  "Qwen/Qwen2.5-VL-7B-Instruct[dtype=bfloat16,batch_size=4]" mathvision

mill --output_dir ./results collect --metric mathvision_acc

Results

Mill

22.04% ± 2.38
Qwen2.5-VL-7B, testmini (304)

Reported

≈25%
Qwen2.5-VL-7B, full test

Difference

≈3 pts
within sampling + harness spread

Mill’s mathvision_acc is graded by an exact port of the official metric, so the score reflects the model/prompt, not the grader. Mill measures 22.04% ± 2.38% for Qwen2.5-VL-7B on the official testmini (304) split. The published full-test number is ≈25% (Qwen2.5-VL Technical Report; reported values span ~24–26% across harnesses depending on prompt and image resolution). testmini is a 304-sample subset, so its score carries a ~±2.4 pp sampling standard error on top of that spread — the two are statistically consistent (~1.2 SE). The full 3,040-sample task reproduces the same benchmark for those with more compute.

Per-model results

Model	Split	Mill (`mathvision_acc`)	Reference	Source
`Qwen/Qwen2.5-VL-7B-Instruct`	`testmini` (304)	22.04% ± 2.38	≈25% (full test)	Qwen2.5-VL Tech Report

​Evaluation configuration

​Reproduce

​Results

Mill

Reported

Difference

​Per-model results

Evaluation configuration

Reproduce

Results

Per-model results