# Mill

> Unified multi-modal evaluation framework for text, image, video, and audio benchmarks.

## Docs

- [Changelog](https://pymill.com/docs/changelog.md): New features, improvements, and fixes in Mill — newest first.
- [Add a benchmark](https://pymill.com/docs/contributing/add-a-benchmark.md): Port a new benchmark into Mill — guided, validated, and documented.
- [Add a model backend](https://pymill.com/docs/contributing/add-a-model.md): Wire a new model backend into Mill — pick the interface, register, document.
- [Distributed Scheduling](https://pymill.com/docs/guides/distributed.md): Scale evaluations across a SLURM cluster with mill schedule.
- [Text Evaluation](https://pymill.com/docs/guides/text-evaluation.md): Run text benchmarks with local HF models, vLLM, or API backends.
- [Vision Evaluation](https://pymill.com/docs/guides/vision-evaluation.md): Evaluate multimodal models on image and video benchmarks.
- [Installation](https://pymill.com/docs/installation.md): Install Mill and optional backend extras.
- [Introduction](https://pymill.com/docs/introduction.md): Mill — a unified multi-modal evaluation framework for text, image, video, and audio benchmarks.
- [Quickstart](https://pymill.com/docs/quickstart.md): Run your first evaluation in under five minutes.
- [CLI Reference](https://pymill.com/docs/reference/cli.md): Complete flag documentation for the mill command.
- [Models](https://pymill.com/docs/reference/models.md): Configure HuggingFace, vLLM, and API model backends.
- [Output Types](https://pymill.com/docs/reference/output-types.md): The three OutputType values and how Mill queries the model for each.
- [Tasks](https://pymill.com/docs/reference/tasks.md): Built-in benchmarks and how to write custom tasks.
- [CIFAR-10](https://pymill.com/docs/reproducibility/cifar10.md): CIFAR-10 zero-shot image classification reproduced with Mill (CLIP and vision-language models).
- [Clotho-AQA](https://pymill.com/docs/reproducibility/clotho_aqa.md): Clotho-AQA single-word audio question answering reproduced with Mill.
- [ImageNet](https://pymill.com/docs/reproducibility/imagenet.md): ImageNet-1k reproduced with Mill — zero-shot image classification (CLIP) and generative MCQ (VLMs).
- [MMLU](https://pymill.com/docs/reproducibility/mmlu.md): MMLU reproduced with Mill (Qwen3-0.6B-Base) versus the Qwen3 Technical Report.
- [MMLU-Pro](https://pymill.com/docs/reproducibility/mmlu-pro.md): MMLU-Pro reproduced with Mill — generative chain-of-thought, 10-option multiple choice.
- [MMMU-Pro](https://pymill.com/docs/reproducibility/mmmu_pro.md): MMMU-Pro (standard, 10 options) multimodal multiple-choice reproduced with Mill.
- [Overview](https://pymill.com/docs/reproducibility/overview.md): How Mill reproduces published benchmark numbers, and a template for adding new ones.