How it works
mill schedule generates a job matrix (models × tasks × n_shots), filters out already-completed jobs from the output cache, then submits a SLURM array job. Each array worker evaluates one (model, task, n_shot) combination and writes results to the shared output_dir.
Cluster configuration
On first run, Mill copies its bundledclusters.yaml to ~/.cache/mill/clusters.yaml. Edit it to match your cluster:
--minutes_per_eval
(raise it for heavy generative tasks like mmlu_pro):
--cache_dir to point Mill at a different location:
Basic usage
schedule <models> <tasks>. Passing the mmlu benchmark
expands to its 57 subject tasks, so this sweep is 57 tasks × 2 n-shot values = 114 jobs.
Selecting the cluster
Dry run — preview without submitting
Local sequential run
Skip SLURM and run all jobs in the current process — useful for debugging:Virtual environments in SLURM jobs
mill eval.
Custom task paths in SLURM workers
If your tasks live outside the Mill package, pass extra directories so the worker can discover them:Checking completion
After the job array finishes:--check flag (default on) lists any missing (model, task, n_shot) combinations so you can resubmit stragglers.