Skip to main content

Welcome to Mill

Mill is a unified multi-modal evaluation framework that gives you one tool for running text, image, video, and audio benchmarks. It combines the best ideas from the existing evaluation ecosystem — output caching, a rich ChatMessages protocol, distributed SLURM scheduling, and a composable metric registry — into a single, consistent interface.

Quickstart

Run your first evaluation in minutes.

Installation

Install Mill and optional backend extras.

Task Reference

Browse supported benchmarks and task formats.

Model Reference

Configure local HF models, vLLM, and API backends.

Design philosophy

Mill borrows proven ideas from across the evaluation landscape:
FeatureBorrowed from
Output caching (Feather, skip completed jobs)unibench
Multimodal ChatMessages protocollmms-eval
Python-class task formatlighteval
SLURM distributed schedulingoellm-evals
Per-family model config filesopencompass
Bootstrap CI + metric registrylighteval

Supported modalities

Text

MMLU and MMLU-Pro built in

Image

CIFAR-10, ImageNet, and MMMU-Pro built in (CLIP, timm, and VLMs)

Video

Custom tasks via decord

Audio

Coming soon

Quick example

# Text evaluation — local HF model
mill --output_dir ./results eval \
     "meta-llama/Meta-Llama-3-8B-Instruct[dtype=bfloat16,batch_size=8]" mmlu,mmlu_pro

# Chain-of-thought benchmark — instruction-tuned model config file
mill --output_dir ./results eval \
     mill/models/configs/qwen/qwen2_5_7b_instruct.py mmlu_pro

# API model (OpenAI / Anthropic) — generative tasks only
mill --output_dir ./results eval "litellm[model=gpt-4o]" mmlu_pro

# Vision — CLIP zero-shot image classification
mill --output_dir ./results eval \
     "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" cifar10