Skip to main content
Mill is in alpha. Releases are tracked below, newest first; APIs and task configs may still change between 0.x versions.
2026-06-30
alpha
Audio evaluation

Audio evaluation

Mill now evaluates audio models alongside text and image, through the same interface.

Benchmarks

  • Clotho-AQA — single-word audio question answering, graded by case/punctuation-insensitive exact match (mirrors the lmms-eval clotho_aqa_test setup). See Reproducibility → Clotho-AQA.

Model backends

  • The HuggingFace backend (hf) now loads audio-language models (e.g. Qwen/Qwen2-Audio-7B-Instruct): it resolves audio architectures the image auto-class doesn’t cover, infers the audio modality from the processor, and resamples/down-mixes waveforms to the model’s expected rate.
  • Audio columns are decoded with soundfile/librosa instead of torchcodec, so audio tasks run without a system FFmpeg.

Task API

  • Doc.audios now flows through the multimodal context builder; set input_modalities=["audio", "text"] to require an audio-capable model.
2026-06-30
alpha
Vision benchmarks

Vision evaluation

Mill now evaluates image models alongside text, through the same interface.

Benchmarks

  • CIFAR-10 and ImageNet-1k — zero-shot image classification, with a generative multiple-choice rendering for vision-language models.
  • MMMU-Pro (standard, 10 options) — multimodal chain-of-thought MCQ, with a CLIP zero-shot rendering for image–text encoders.
Each vision benchmark ships two renderings of the same data and auto-selects the one your model supports (pick_variant_by_model).

Model backends

  • open_clip (clip) — CLIP-style zero-shot image classification / retrieval.
  • timm — supervised vision classification over a fixed pretrained head.

Task API

  • task_type is now the primary axis (generative vs. classification families); input_modalities lets a task require a capability (e.g. image) and reject models that lack it.

Reproducibility

  • CLIP ViT-B-32/laion2b_s34b_b79k reproduces 93.56% zero-shot on CIFAR-10. See Reproducibility.
2026-06-20
alpha
v0.1.0

v0.1.0 — Initial alpha release

The first public release of Mill — a unified multi-modal evaluation framework that runs text, image, video, and audio benchmarks through one consistent interface.

Evaluation

  • mill eval — run a model on one or more tasks or benchmarks locally.
  • mill collect — aggregate results into a long-format table with bootstrap standard errors.
  • mill schedule — distribute a (models × tasks × n-shots) sweep across a SLURM cluster.
  • mill ls — interactive TUI browser for benchmarks and tasks.

Model backends

  • HuggingFace Transformers (text and multimodal), vLLM, and LiteLLM (OpenAI, Anthropic, and 100+ API providers).

Benchmarks

  • MMLU — 57 subjects, 5-shot log-probability scoring.
  • MMLU-Pro — 10-option, generative chain-of-thought with answer-letter extraction.

Output & metrics

  • Feather output caching — completed (model, task, n-shot) jobs are skipped on re-run.
  • Composable metric registry with bootstrap confidence intervals.

Reproducibility

  • Validated MMLU for Qwen/Qwen3-0.6B-Base at 53.78% ± 1.53, within one standard error of the Qwen3 Technical Report’s 52.81%. See Reproducibility.