> ## Documentation Index
> Fetch the complete documentation index at: https://pymill.com/llms.txt
> Use this file to discover all available pages before exploring further.

# MATH-Vision

> MATH-Vision (MATH-V) multimodal mathematical reasoning reproduced with Mill.

MATH-Vision (MATH-V) is a **multimodal mathematical reasoning** benchmark: 3,040 competition problems, each paired with a visual context (figure, diagram, or plot), spanning 16 subjects and 5 difficulty levels. Each problem is either multiple-choice (5 options, A–E) or free-form (a numeric/LaTeX answer).

Mill runs it as a single generative task for vision-language models (`mathvision`): the model sees the image and the question, reasons step by step, and puts its final answer in `\boxed{}`. The boxed answer is extracted and graded by a faithful port of the [official MATH-Vision grader](https://github.com/mathllm/MATH-V) (`evaluation/{utils,evaluate}.py`) — case-insensitive exact match, tuple/list handling, and LaTeX-aware symbolic equivalence via `latex2sympy2`, compared against both the gold letter and the gold option text. A 304-sample `mathvision_testmini` task (the official balanced mini split, 19 problems per subject) is provided for fast iteration.

<Note>
  **Grader fidelity is exact.** Replaying the official repo's own committed model outputs through Mill's ported grader reproduces its per-sample labels with **0 mismatches across all 3,040 test samples** (GPT-4V-CoT → 23.98%, Qwen-VL-Max-CoT → 12.63%, LLaVA-1.5-13B → 11.12% — each matching the published number). Without the optional `latex2sympy2` extra only 5 / 3,040 samples change (letters and plain numbers grade through string normalisation).
</Note>

## Evaluation configuration

| Hyperparameter | Value                                                           |
| -------------- | --------------------------------------------------------------- |
| Benchmark      | `mathvision` (full `test`, 3,040) / `mathvision_testmini` (304) |
| Dataset        | `MathLLMs/MathVision` (`test` / `testmini`)                     |
| n-shots        | `0` (zero-shot, boxed chain-of-thought)                         |
| Task type      | `GENERATIVE_QA` (boxed CoT), vision-language                    |
| Metric         | `mathvision_acc` (official LaTeX-aware answer matching)         |
| Max new tokens | `2048`                                                          |
| Backend        | HuggingFace VLM (`hf`)                                          |

## Reproduce

```bash theme={null}
# Optional: exact symbolic-equivalence grading (matches the official grader)
pip install 'mill-eval[mathvision]'

# Official 304-sample mini split (fast)
mill --output_dir ./results eval \
  "Qwen/Qwen2.5-VL-7B-Instruct[dtype=bfloat16,batch_size=4]" mathvision_testmini

# Full 3,040-sample test split (the canonical benchmark)
mill --output_dir ./results eval \
  "Qwen/Qwen2.5-VL-7B-Instruct[dtype=bfloat16,batch_size=4]" mathvision

mill --output_dir ./results collect --metric mathvision_acc
```

## Results

<CardGroup cols={3}>
  <Card title="Mill" icon="gauge">
    **22.04%** ± 2.38 \
    Qwen2.5-VL-7B, testmini (304)
  </Card>

  <Card title="Reported" icon="book">
    **≈25%** \
    Qwen2.5-VL-7B, full test
  </Card>

  <Card title="Difference" icon="scale-balanced">
    **≈3 pts** \
    within sampling + harness spread
  </Card>
</CardGroup>

<Note>
  Mill's `mathvision_acc` is graded by an exact port of the official metric, so the score reflects the model/prompt, not the grader. Mill measures **22.04% ± 2.38%** for Qwen2.5-VL-7B on the official **testmini** (304) split. The published full-test number is **≈25%** ([Qwen2.5-VL Technical Report](https://arxiv.org/abs/2502.13923); reported values span \~24–26% across harnesses depending on prompt and image resolution). testmini is a 304-sample subset, so its score carries a \~±2.4 pp sampling standard error on top of that spread — the two are statistically consistent (\~1.2 SE). The full 3,040-sample task reproduces the same benchmark for those with more compute.
</Note>

### Per-model results

| Model                         | Split            | Mill (`mathvision_acc`) | Reference        | Source                                                     |
| ----------------------------- | ---------------- | ----------------------- | ---------------- | ---------------------------------------------------------- |
| `Qwen/Qwen2.5-VL-7B-Instruct` | `testmini` (304) | **22.04%** ± 2.38       | ≈25% (full test) | [Qwen2.5-VL Tech Report](https://arxiv.org/abs/2502.13923) |
