Vision Evaluation

Overview

Mill uses the ChatMessages protocol to pass multimodal inputs — images, video frames, or audio — alongside text. The model backend handles format conversion automatically.

Built-in vision benchmarks

Mill ships three image benchmarks, each runnable by CLIP-style encoders and vision-language models (see Reproducibility for results):

Benchmark	Renderings	Run it
`cifar10`	zero-shot (CLIP) · generative MCQ (VLM)	`mill eval "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" cifar10`
`imagenet`	zero-shot (CLIP) · generative MCQ (VLM)	`mill eval "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" imagenet`
`mmmu_pro`	generative CoT (VLM) · zero-shot (CLIP)	`mill eval "Qwen/Qwen3-VL-2B-Instruct[dtype=bfloat16]" mmmu_pro`

Each benchmark auto-selects the rendering your model supports. To add your own multimodal task, see How multimodal tasks work and register it with --task_paths.

Using a model config file

Multimodal models need their modalities declared as a list — which can’t be expressed inline in brackets — so configure them with a Python config file (opencompass style):

mill --output_dir ./results eval \
  mill/models/configs/qwen/qwen2_5_vl_7b.py my_vqa \
  --task_paths ./my-tasks

A config file exports a top-level model dict (mirroring TransformersModel keyword arguments, plus optional abbr and run_cfg). The bundled Qwen2.5-VL config:

# mill/models/configs/qwen/qwen2_5_vl_7b.py
from mill.models.transformers import TransformersModel

model = dict(
    type=TransformersModel,
    abbr="qwen2.5-vl-7b",
    path="Qwen/Qwen2.5-VL-7B-Instruct",
    modalities=["text", "image", "video"],
    dtype="bfloat16",
    max_context_length=32768,
    run_cfg=dict(num_gpus=1, batch_size=4),
)

Available config families

Family	Path
Qwen2.5-VL	`mill/models/configs/qwen/`
InternVL	`mill/models/configs/internvl/`
Llama	`mill/models/configs/llama/`

How multimodal tasks work

A multimodal task’s prompt_function (or doc_to_visual) returns a Doc with the visuals field populated. Mill assembles a ChatMessages object and passes it to the model:

from mill.api.task import Doc

def my_vqa_prompt(row: dict) -> Doc:
    return Doc(
        query=row["question"],
        visuals=[row["image"]],          # PIL.Image, file path, or URL
        target_index=row["answer_idx"],
    )

Video tasks

Install the video extra first:

pip install -e ".[video]"

Video tasks work the same way — Doc.videos holds paths to video files, and Mill uses decord to decode frames before passing them to the model.

Limiting samples

mill --limit 100 --output_dir ./results eval \
  mill/models/configs/qwen/qwen2_5_vl_7b.py my_vqa \
  --task_paths ./my-tasks

​Overview

​Built-in vision benchmarks

​Using a model config file

​Available config families

​How multimodal tasks work

​Video tasks

​Limiting samples