Skip to main content

Prerequisites

  • Python 3.10+
  • A GPU (for local HF/vLLM models) or an API key (for LiteLLM backends)

1. Install

Install straight from GitHub — no clone required:
pip install "mill-eval @ git+https://github.com/haideraltahan/Mill.git"

2. Run a text evaluation

mill eval "meta-llama/Meta-Llama-3-8B-Instruct[dtype=bfloat16,batch_size=8]" mmlu \
  --output_dir ./results
Mill streams progress to your terminal and writes a Feather file to ./results/ when done.

3. View results

mill --output_dir ./results collect --metric acc
The collect command renders a table of scores in your terminal. Pass --metric to choose which metric to show — MMLU reports acc:
Mill results — performance (acc)
┌─────────────────────────────────────┬────────┐
│ model                               │ mmlu   │
├─────────────────────────────────────┼────────┤
│ meta-llama/Meta-Llama-3-8B-Instruct │ 0.6398 │
└─────────────────────────────────────┴────────┘

4. Browse available tasks

mill ls
This opens a full-screen TUI browser. Use ↑ ↓ to navigate, Tab to switch between Benchmarks and Tasks, and Enter to copy a task name to your clipboard.

Next steps

Text evaluation guide

Few-shot, custom metrics, and n-shot sweeps.

Vision evaluation guide

Multimodal models with image/video inputs.

Distributed scheduling

Scale across a SLURM cluster.

CLI reference

Full flag documentation.