0.x versions.
Audio evaluation
Mill now evaluates audio models alongside text and image, through the same interface.Benchmarks
- Clotho-AQA — single-word audio question answering, graded by case/punctuation-insensitive exact match (mirrors the lmms-eval
clotho_aqa_testsetup). See Reproducibility → Clotho-AQA.
Model backends
- The HuggingFace backend (
hf) now loads audio-language models (e.g.Qwen/Qwen2-Audio-7B-Instruct): it resolves audio architectures the image auto-class doesn’t cover, infers theaudiomodality from the processor, and resamples/down-mixes waveforms to the model’s expected rate. - Audio columns are decoded with soundfile/librosa instead of
torchcodec, so audio tasks run without a system FFmpeg.
Task API
Doc.audiosnow flows through the multimodal context builder; setinput_modalities=["audio", "text"]to require an audio-capable model.
Vision evaluation
Mill now evaluates image models alongside text, through the same interface.Benchmarks
- CIFAR-10 and ImageNet-1k — zero-shot image classification, with a generative multiple-choice rendering for vision-language models.
- MMMU-Pro (standard, 10 options) — multimodal chain-of-thought MCQ, with a CLIP zero-shot rendering for image–text encoders.
pick_variant_by_model).Model backends
- open_clip (
clip) — CLIP-style zero-shot image classification / retrieval. - timm — supervised vision classification over a fixed pretrained head.
Task API
task_typeis now the primary axis (generative vs. classification families);input_modalitieslets a task require a capability (e.g. image) and reject models that lack it.
Reproducibility
- CLIP
ViT-B-32/laion2b_s34b_b79kreproduces 93.56% zero-shot on CIFAR-10. See Reproducibility.
v0.1.0 — Initial alpha release
The first public release of Mill — a unified multi-modal evaluation framework that runs text, image, video, and audio benchmarks through one consistent interface.Evaluation
mill eval— run a model on one or more tasks or benchmarks locally.mill collect— aggregate results into a long-format table with bootstrap standard errors.mill schedule— distribute a (models × tasks × n-shots) sweep across a SLURM cluster.mill ls— interactive TUI browser for benchmarks and tasks.
Model backends
- HuggingFace Transformers (text and multimodal), vLLM, and LiteLLM (OpenAI, Anthropic, and 100+ API providers).
Benchmarks
- MMLU — 57 subjects, 5-shot log-probability scoring.
- MMLU-Pro — 10-option, generative chain-of-thought with answer-letter extraction.
Output & metrics
- Feather output caching — completed
(model, task, n-shot)jobs are skipped on re-run. - Composable metric registry with bootstrap confidence intervals.
Reproducibility
- Validated MMLU for
Qwen/Qwen3-0.6B-Baseat 53.78% ± 1.53, within one standard error of the Qwen3 Technical Report’s 52.81%. See Reproducibility.