Introduction

Welcome to Mill

Mill is a unified multi-modal evaluation framework that gives you one tool for running text, image, video, and audio benchmarks. It combines the best ideas from the existing evaluation ecosystem — output caching, a rich ChatMessages protocol, distributed SLURM scheduling, and a composable metric registry — into a single, consistent interface.

Quickstart

Run your first evaluation in minutes.

Installation

Install Mill and optional backend extras.

Task Reference

Browse supported benchmarks and task formats.

Model Reference

Configure local HF models, vLLM, and API backends.

Design philosophy

Mill borrows proven ideas from across the evaluation landscape:

Feature	Borrowed from
Output caching (Feather, skip completed jobs)	unibench
Multimodal `ChatMessages` protocol	lmms-eval
Python-class task format	lighteval
SLURM distributed scheduling	oellm-evals
Per-family model config files	opencompass
Bootstrap CI + metric registry	lighteval

Supported modalities

Text

MMLU and MMLU-Pro built in

Image

CIFAR-10, ImageNet, and MMMU-Pro built in (CLIP, timm, and VLMs)

Video

Custom tasks via decord

Audio

Coming soon

Quick example

# Text evaluation — local HF model
mill --output_dir ./results eval \
     "meta-llama/Meta-Llama-3-8B-Instruct[dtype=bfloat16,batch_size=8]" mmlu,mmlu_pro

# Chain-of-thought benchmark — instruction-tuned model config file
mill --output_dir ./results eval \
     mill/models/configs/qwen/qwen2_5_7b_instruct.py mmlu_pro

# API model (OpenAI / Anthropic) — generative tasks only
mill --output_dir ./results eval "litellm[model=gpt-4o]" mmlu_pro

# Vision — CLIP zero-shot image classification
mill --output_dir ./results eval \
     "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" cifar10

Quickstart

​Welcome to Mill

Quickstart

Installation

Task Reference

Model Reference

​Design philosophy

​Supported modalities

Text

Image

Video

Audio

​Quick example

Welcome to Mill

Design philosophy

Supported modalities

Quick example