Global flags
These flags apply to all subcommands and must be placed before the subcommand name:| Flag | Default | Description |
|---|---|---|
--output_dir | ./mill_results | Directory where Feather result files are written |
--cache_dir | ~/.cache/mill | Directory for clusters.yaml, SLURM job CSVs, and logs |
--limit | — | Cap samples per task (useful for smoke tests) |
mill eval
Run evaluation locally on one or more models.| Argument | Default | Description |
|---|---|---|
models | required | Comma-separated model spec(s). Each spec is a HF model ID (meta-llama/Meta-Llama-3-8B-Instruct), a backend name (hf, vllm, clip, litellm, timm), or a path to a Python config file — optionally with inline args in brackets |
tasks | required | Comma-separated task or benchmark names (mmlu,mmlu_pro) |
--task_paths | — | Comma-separated extra directories to scan for custom tasks |
--seed | 42 | Seed for all randomness (option shuffles, few-shot sampling, random-guess fallbacks) |
Inline model args
Pass model arguments inline in brackets,key=value separated by commas — there is no --model_args flag. Quote the spec so your shell doesn’t expand the brackets:
modalities) can’t be expressed inline — use a Python config file for those.
Examples
mill schedule
Generate and submit a SLURM job array.| Argument | Default | Description |
|---|---|---|
models | required | Comma-separated model IDs or list |
tasks | required | Comma-separated task names |
--n_shots | 0 | Comma-separated n-shot values to sweep |
--cluster | auto | Cluster name from clusters.yaml, or auto for hostname detection |
--local | false | Run all jobs sequentially (no SLURM) |
--dry_run | false | Print job table without submitting |
--venv_path | — | Virtual environment to activate in SLURM workers |
--extra_task_paths | — | Extra task directories for SLURM workers |
--minutes_per_eval | 0 | Walltime budget per (model, task, n_shot) eval (sizes the array and time limit). 0 = built-in default; raise it for heavy generative tasks like mmlu_pro |
mill collect
Display aggregated performance and check for missing jobs. Results are read from the long-formataggregate.csv (one row per model, task, n_shot, metric with performance and stderr columns), so scores compare across any benchmark.
| Flag | Default | Description |
|---|---|---|
--models | all | Filter to specific model(s) |
--tasks | all | Filter to specific task(s) |
--metric | all | Restrict the table to one metric name (e.g. acc, exact_match); default shows every metric’s performance |
--check | true | Print missing (model, task, n_shot) combinations |
--n_shots | 0 | n-shot values to check for completeness |
mill ls
Open the interactive TUI browser for benchmarks and tasks.| Key | Action |
|---|---|
↑ / ↓ | Navigate list |
Tab / Shift+Tab / ← / → | Switch tabs |
Shift+↑ / Shift+↓ | Scroll detail panel |
| Type | Filter (fuzzy search) |
Enter | Copy selected name to clipboard and exit |
Escape / Ctrl+C | Exit |