Skip to main content

Global flags

These flags apply to all subcommands and must be placed before the subcommand name:
mill [GLOBAL FLAGS] <subcommand> [SUBCOMMAND FLAGS]
FlagDefaultDescription
--output_dir./mill_resultsDirectory where Feather result files are written
--cache_dir~/.cache/millDirectory for clusters.yaml, SLURM job CSVs, and logs
--limitCap samples per task (useful for smoke tests)

mill eval

Run evaluation locally on one or more models.
mill [GLOBAL] eval <models> <tasks> [--task_paths DIR,...] [--seed N]
ArgumentDefaultDescription
modelsrequiredComma-separated model spec(s). Each spec is a HF model ID (meta-llama/Meta-Llama-3-8B-Instruct), a backend name (hf, vllm, clip, litellm, timm), or a path to a Python config file — optionally with inline args in brackets
tasksrequiredComma-separated task or benchmark names (mmlu,mmlu_pro)
--task_pathsComma-separated extra directories to scan for custom tasks
--seed42Seed for all randomness (option shuffles, few-shot sampling, random-guess fallbacks)

Inline model args

Pass model arguments inline in brackets, key=value separated by commas — there is no --model_args flag. Quote the spec so your shell doesn’t expand the brackets:
"clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]"
"Qwen/Qwen3-0.6B-Base[dtype=bfloat16,batch_size=8]"
"litellm[model=gpt-4o]"
A list-valued arg (e.g. modalities) can’t be expressed inline — use a Python config file for those.

Examples

# Local HF model
mill --output_dir ./results eval "meta-llama/Meta-Llama-3-8B-Instruct[dtype=bfloat16,batch_size=8]" mmlu

# Python config file (instruction-tuned model + chain-of-thought benchmark)
mill --output_dir ./results eval mill/models/configs/qwen/qwen2_5_7b_instruct.py mmlu_pro

# CLIP zero-shot vision benchmark
mill --output_dir ./results eval "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" cifar10

# API model — generative tasks only (no log-prob), so use mmlu_pro not mmlu
mill --output_dir ./results eval "litellm[model=gpt-4o]" mmlu_pro

# Smoke test — first 50 samples only
mill --output_dir ./results --limit 50 eval meta-llama/Meta-Llama-3-8B-Instruct mmlu

mill schedule

Generate and submit a SLURM job array.
mill [GLOBAL] schedule <models> <tasks> [FLAGS]
ArgumentDefaultDescription
modelsrequiredComma-separated model IDs or list
tasksrequiredComma-separated task names
--n_shots0Comma-separated n-shot values to sweep
--clusterautoCluster name from clusters.yaml, or auto for hostname detection
--localfalseRun all jobs sequentially (no SLURM)
--dry_runfalsePrint job table without submitting
--venv_pathVirtual environment to activate in SLURM workers
--extra_task_pathsExtra task directories for SLURM workers
--minutes_per_eval0Walltime budget per (model, task, n_shot) eval (sizes the array and time limit). 0 = built-in default; raise it for heavy generative tasks like mmlu_pro

mill collect

Display aggregated performance and check for missing jobs. Results are read from the long-format aggregate.csv (one row per model, task, n_shot, metric with performance and stderr columns), so scores compare across any benchmark.
mill [GLOBAL] collect [--models M,...] [--tasks T,...] [--metric METRIC] [--n_shots N,...]
FlagDefaultDescription
--modelsallFilter to specific model(s)
--tasksallFilter to specific task(s)
--metricallRestrict the table to one metric name (e.g. acc, exact_match); default shows every metric’s performance
--checktruePrint missing (model, task, n_shot) combinations
--n_shots0n-shot values to check for completeness

mill ls

Open the interactive TUI browser for benchmarks and tasks.
mill ls
KeyAction
/ Navigate list
Tab / Shift+Tab / / Switch tabs
Shift+↑ / Shift+↓Scroll detail panel
TypeFilter (fuzzy search)
EnterCopy selected name to clipboard and exit
Escape / Ctrl+CExit