Overview
Mill uses theChatMessages protocol to pass multimodal inputs — images, video frames, or audio — alongside text. The model backend handles format conversion automatically.
Built-in vision benchmarks
Mill ships three image benchmarks, each runnable by CLIP-style encoders and vision-language models (see Reproducibility for results):| Benchmark | Renderings | Run it |
|---|---|---|
cifar10 | zero-shot (CLIP) · generative MCQ (VLM) | mill eval "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" cifar10 |
imagenet | zero-shot (CLIP) · generative MCQ (VLM) | mill eval "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" imagenet |
mmmu_pro | generative CoT (VLM) · zero-shot (CLIP) | mill eval "Qwen/Qwen3-VL-2B-Instruct[dtype=bfloat16]" mmmu_pro |
--task_paths.
Using a model config file
Multimodal models need theirmodalities declared as a list — which can’t be expressed inline in brackets — so configure them with a Python config file (opencompass style):
model dict (mirroring TransformersModel keyword arguments, plus optional abbr and run_cfg). The bundled Qwen2.5-VL config:
Available config families
| Family | Path |
|---|---|
| Qwen2.5-VL | mill/models/configs/qwen/ |
| InternVL | mill/models/configs/internvl/ |
| Llama | mill/models/configs/llama/ |
How multimodal tasks work
A multimodal task’sprompt_function (or doc_to_visual) returns a Doc with the visuals field populated. Mill assembles a ChatMessages object and passes it to the model:
Video tasks
Install the video extra first:Doc.videos holds paths to video files, and Mill uses decord to decode frames before passing them to the model.