CLI Reference

Global flags

These flags apply to all subcommands and must be placed before the subcommand name:

mill [GLOBAL FLAGS] <subcommand> [SUBCOMMAND FLAGS]

Flag	Default	Description
`--output_dir`	`./mill_results`	Directory where Feather result files are written
`--cache_dir`	`~/.cache/mill`	Directory for `clusters.yaml`, SLURM job CSVs, and logs
`--limit`	—	Cap samples per task (useful for smoke tests)

mill eval

Run evaluation locally on one or more models.

mill [GLOBAL] eval <models> <tasks> [--task_paths DIR,...] [--seed N]

Argument	Default	Description
`models`	required	Comma-separated model spec(s). Each spec is a HF model ID (`meta-llama/Meta-Llama-3-8B-Instruct`), a backend name (`hf`, `vllm`, `clip`, `litellm`, `timm`), or a path to a Python config file — optionally with inline args in brackets
`tasks`	required	Comma-separated task or benchmark names (`mmlu,mmlu_pro`)
`--task_paths`	—	Comma-separated extra directories to scan for custom tasks
`--seed`	`42`	Seed for all randomness (option shuffles, few-shot sampling, random-guess fallbacks)

Inline model args

Pass model arguments inline in brackets, key=value separated by commas — there is no --model_args flag. Quote the spec so your shell doesn’t expand the brackets:

"clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]"
"Qwen/Qwen3-0.6B-Base[dtype=bfloat16,batch_size=8]"
"litellm[model=gpt-4o]"

A list-valued arg (e.g. modalities) can’t be expressed inline — use a Python config file for those.

Examples

# Local HF model
mill --output_dir ./results eval "meta-llama/Meta-Llama-3-8B-Instruct[dtype=bfloat16,batch_size=8]" mmlu

# Python config file (instruction-tuned model + chain-of-thought benchmark)
mill --output_dir ./results eval mill/models/configs/qwen/qwen2_5_7b_instruct.py mmlu_pro

# CLIP zero-shot vision benchmark
mill --output_dir ./results eval "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" cifar10

# API model — generative tasks only (no log-prob), so use mmlu_pro not mmlu
mill --output_dir ./results eval "litellm[model=gpt-4o]" mmlu_pro

# Smoke test — first 50 samples only
mill --output_dir ./results --limit 50 eval meta-llama/Meta-Llama-3-8B-Instruct mmlu

mill schedule

Generate and submit a SLURM job array.

mill [GLOBAL] schedule <models> <tasks> [FLAGS]

Argument	Default	Description
`models`	required	Comma-separated model IDs or list
`tasks`	required	Comma-separated task names
`--n_shots`	`0`	Comma-separated n-shot values to sweep
`--cluster`	`auto`	Cluster name from `clusters.yaml`, or `auto` for hostname detection
`--local`	false	Run all jobs sequentially (no SLURM)
`--dry_run`	false	Print job table without submitting
`--venv_path`	—	Virtual environment to activate in SLURM workers
`--extra_task_paths`	—	Extra task directories for SLURM workers
`--minutes_per_eval`	`0`	Walltime budget per `(model, task, n_shot)` eval (sizes the array and time limit). `0` = built-in default; raise it for heavy generative tasks like `mmlu_pro`

mill collect

Display aggregated performance and check for missing jobs. Results are read from the long-format aggregate.csv (one row per model, task, n_shot, metric with performance and stderr columns), so scores compare across any benchmark.

mill [GLOBAL] collect [--models M,...] [--tasks T,...] [--metric METRIC] [--n_shots N,...]

Flag	Default	Description
`--models`	all	Filter to specific model(s)
`--tasks`	all	Filter to specific task(s)
`--metric`	all	Restrict the table to one metric name (e.g. `acc`, `exact_match`); default shows every metric’s `performance`
`--check`	true	Print missing `(model, task, n_shot)` combinations
`--n_shots`	`0`	n-shot values to check for completeness

mill ls

Open the interactive TUI browser for benchmarks and tasks.

mill ls

Key	Action
`↑` / `↓`	Navigate list
`Tab` / `Shift+Tab` / `←` / `→`	Switch tabs
`Shift+↑` / `Shift+↓`	Scroll detail panel
Type	Filter (fuzzy search)
`Enter`	Copy selected name to clipboard and exit
`Escape` / `Ctrl+C`	Exit

​Global flags

​mill eval

​Inline model args

​Examples

​mill schedule

​mill collect

​mill ls