Skip to main content

Overview

Mill uses the ChatMessages protocol to pass multimodal inputs — images, video frames, or audio — alongside text. The model backend handles format conversion automatically.

Built-in vision benchmarks

Mill ships three image benchmarks, each runnable by CLIP-style encoders and vision-language models (see Reproducibility for results):
BenchmarkRenderingsRun it
cifar10zero-shot (CLIP) · generative MCQ (VLM)mill eval "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" cifar10
imagenetzero-shot (CLIP) · generative MCQ (VLM)mill eval "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" imagenet
mmmu_progenerative CoT (VLM) · zero-shot (CLIP)mill eval "Qwen/Qwen3-VL-2B-Instruct[dtype=bfloat16]" mmmu_pro
Each benchmark auto-selects the rendering your model supports. To add your own multimodal task, see How multimodal tasks work and register it with --task_paths.

Using a model config file

Multimodal models need their modalities declared as a list — which can’t be expressed inline in brackets — so configure them with a Python config file (opencompass style):
mill --output_dir ./results eval \
  mill/models/configs/qwen/qwen2_5_vl_7b.py my_vqa \
  --task_paths ./my-tasks
A config file exports a top-level model dict (mirroring TransformersModel keyword arguments, plus optional abbr and run_cfg). The bundled Qwen2.5-VL config:
# mill/models/configs/qwen/qwen2_5_vl_7b.py
from mill.models.transformers import TransformersModel

model = dict(
    type=TransformersModel,
    abbr="qwen2.5-vl-7b",
    path="Qwen/Qwen2.5-VL-7B-Instruct",
    modalities=["text", "image", "video"],
    dtype="bfloat16",
    max_context_length=32768,
    run_cfg=dict(num_gpus=1, batch_size=4),
)

Available config families

FamilyPath
Qwen2.5-VLmill/models/configs/qwen/
InternVLmill/models/configs/internvl/
Llamamill/models/configs/llama/

How multimodal tasks work

A multimodal task’s prompt_function (or doc_to_visual) returns a Doc with the visuals field populated. Mill assembles a ChatMessages object and passes it to the model:
from mill.api.task import Doc

def my_vqa_prompt(row: dict) -> Doc:
    return Doc(
        query=row["question"],
        visuals=[row["image"]],          # PIL.Image, file path, or URL
        target_index=row["answer_idx"],
    )

Video tasks

Install the video extra first:
pip install -e ".[video]"
Video tasks work the same way — Doc.videos holds paths to video files, and Mill uses decord to decode frames before passing them to the model.

Limiting samples

mill --limit 100 --output_dir ./results eval \
  mill/models/configs/qwen/qwen2_5_vl_7b.py my_vqa \
  --task_paths ./my-tasks