Welcome to Mill
Mill is a unified multi-modal evaluation framework that gives you one tool for running text, image, video, and audio benchmarks. It combines the best ideas from the existing evaluation ecosystem — output caching, a rich ChatMessages protocol, distributed SLURM scheduling, and a composable metric registry — into a single, consistent interface.Quickstart
Run your first evaluation in minutes.
Installation
Install Mill and optional backend extras.
Task Reference
Browse supported benchmarks and task formats.
Model Reference
Configure local HF models, vLLM, and API backends.
Design philosophy
Mill borrows proven ideas from across the evaluation landscape:| Feature | Borrowed from |
|---|---|
| Output caching (Feather, skip completed jobs) | unibench |
Multimodal ChatMessages protocol | lmms-eval |
| Python-class task format | lighteval |
| SLURM distributed scheduling | oellm-evals |
| Per-family model config files | opencompass |
| Bootstrap CI + metric registry | lighteval |
Supported modalities
Text
MMLU and MMLU-Pro built in
Image
CIFAR-10, ImageNet, and MMMU-Pro built in (CLIP, timm, and VLMs)
Video
Custom tasks via decord
Audio
Coming soon