> ## Documentation Index
> Fetch the complete documentation index at: https://pymill.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Vision Evaluation

> Evaluate multimodal models on image and video benchmarks.

## Overview

Mill uses the `ChatMessages` protocol to pass multimodal inputs — images, video frames, or audio — alongside text. The model backend handles format conversion automatically.

## Built-in vision benchmarks

Mill ships three image benchmarks, each runnable by CLIP-style encoders and vision-language models (see [Reproducibility](/docs/reproducibility/overview) for results):

| Benchmark  | Renderings                              | Run it                                                                  |
| ---------- | --------------------------------------- | ----------------------------------------------------------------------- |
| `cifar10`  | zero-shot (CLIP) · generative MCQ (VLM) | `mill eval "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" cifar10`  |
| `imagenet` | zero-shot (CLIP) · generative MCQ (VLM) | `mill eval "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" imagenet` |
| `mmmu_pro` | generative CoT (VLM) · zero-shot (CLIP) | `mill eval "Qwen/Qwen3-VL-2B-Instruct[dtype=bfloat16]" mmmu_pro`        |

Each benchmark auto-selects the rendering your model supports. To add your own multimodal task, see [How multimodal tasks work](#how-multimodal-tasks-work) and register it with `--task_paths`.

## Using a model config file

Multimodal models need their `modalities` declared as a list — which can't be expressed inline in brackets — so configure them with a Python config file (opencompass style):

```bash theme={null}
mill --output_dir ./results eval \
  mill/models/configs/qwen/qwen2_5_vl_7b.py my_vqa \
  --task_paths ./my-tasks
```

A config file exports a top-level `model` dict (mirroring `TransformersModel` keyword arguments, plus optional `abbr` and `run_cfg`). The bundled Qwen2.5-VL config:

```python theme={null}
# mill/models/configs/qwen/qwen2_5_vl_7b.py
from mill.models.transformers import TransformersModel

model = dict(
    type=TransformersModel,
    abbr="qwen2.5-vl-7b",
    path="Qwen/Qwen2.5-VL-7B-Instruct",
    modalities=["text", "image", "video"],
    dtype="bfloat16",
    max_context_length=32768,
    run_cfg=dict(num_gpus=1, batch_size=4),
)
```

### Available config families

| Family     | Path                            |
| ---------- | ------------------------------- |
| Qwen2.5-VL | `mill/models/configs/qwen/`     |
| InternVL   | `mill/models/configs/internvl/` |
| Llama      | `mill/models/configs/llama/`    |

## How multimodal tasks work

A multimodal task's `prompt_function` (or `doc_to_visual`) returns a `Doc` with the `visuals` field populated. Mill assembles a `ChatMessages` object and passes it to the model:

```python theme={null}
from mill.api.task import Doc

def my_vqa_prompt(row: dict) -> Doc:
    return Doc(
        query=row["question"],
        visuals=[row["image"]],          # PIL.Image, file path, or URL
        target_index=row["answer_idx"],
    )
```

## Video tasks

Install the video extra first:

```bash theme={null}
pip install -e ".[video]"
```

Video tasks work the same way — `Doc.videos` holds paths to video files, and Mill uses `decord` to decode frames before passing them to the model.

## Limiting samples

```bash theme={null}
mill --limit 100 --output_dir ./results eval \
  mill/models/configs/qwen/qwen2_5_vl_7b.py my_vqa \
  --task_paths ./my-tasks
```
