Changelog - Mill

Mill is in alpha. Releases are tracked below, newest first; APIs and task configs may still change between 0.x versions.

2026-06-30

alpha

Audio evaluation

Mill now evaluates audio models alongside text and image, through the same interface.

Benchmarks

Clotho-AQA — single-word audio question answering, graded by case/punctuation-insensitive exact match (mirrors the lmms-eval clotho_aqa_test setup). See Reproducibility → Clotho-AQA.

Model backends

The HuggingFace backend (hf) now loads audio-language models (e.g. Qwen/Qwen2-Audio-7B-Instruct): it resolves audio architectures the image auto-class doesn’t cover, infers the audio modality from the processor, and resamples/down-mixes waveforms to the model’s expected rate.
Audio columns are decoded with soundfile/librosa instead of torchcodec, so audio tasks run without a system FFmpeg.

Task API

Doc.audios now flows through the multimodal context builder; set input_modalities=["audio", "text"] to require an audio-capable model.

2026-06-30

alpha

Vision benchmarks

Vision evaluation

Mill now evaluates image models alongside text, through the same interface.

Benchmarks

CIFAR-10 and ImageNet-1k — zero-shot image classification, with a generative multiple-choice rendering for vision-language models.
MMMU-Pro (standard, 10 options) — multimodal chain-of-thought MCQ, with a CLIP zero-shot rendering for image–text encoders.

Each vision benchmark ships two renderings of the same data and auto-selects the one your model supports (pick_variant_by_model).

Model backends

open_clip (clip) — CLIP-style zero-shot image classification / retrieval.
timm — supervised vision classification over a fixed pretrained head.

Task API

task_type is now the primary axis (generative vs. classification families); input_modalities lets a task require a capability (e.g. image) and reject models that lack it.

Reproducibility

CLIP ViT-B-32/laion2b_s34b_b79k reproduces 93.56% zero-shot on CIFAR-10. See Reproducibility.

2026-06-20

alpha

v0.1.0

v0.1.0 — Initial alpha release

The first public release of Mill — a unified multi-modal evaluation framework that runs text, image, video, and audio benchmarks through one consistent interface.

Evaluation

mill eval — run a model on one or more tasks or benchmarks locally.
mill collect — aggregate results into a long-format table with bootstrap standard errors.
mill schedule — distribute a (models × tasks × n-shots) sweep across a SLURM cluster.
mill ls — interactive TUI browser for benchmarks and tasks.

Model backends

HuggingFace Transformers (text and multimodal), vLLM, and LiteLLM (OpenAI, Anthropic, and 100+ API providers).

Benchmarks

MMLU — 57 subjects, 5-shot log-probability scoring.
MMLU-Pro — 10-option, generative chain-of-thought with answer-letter extraction.

Output & metrics

Feather output caching — completed (model, task, n-shot) jobs are skipped on re-run.
Composable metric registry with bootstrap confidence intervals.

Reproducibility

Validated MMLU for Qwen/Qwen3-0.6B-Base at 53.78% ± 1.53, within one standard error of the Qwen3 Technical Report’s 52.81%. See Reproducibility.

​Audio evaluation

​Benchmarks

​Model backends

​Task API

​Vision evaluation

​Benchmarks

​Model backends

​Task API

​Reproducibility

​v0.1.0 — Initial alpha release

​Evaluation

​Model backends

​Benchmarks

​Output & metrics

​Reproducibility

Audio evaluation

Benchmarks

Model backends

Task API

Vision evaluation

Benchmarks

Model backends

Task API

Reproducibility

v0.1.0 — Initial alpha release

Evaluation

Model backends

Benchmarks

Output & metrics

Reproducibility