Local HuggingFace model
The default backend loads models viatransformers.AutoModelForCausalLM.
key=value, comma-separated); quote the spec so your shell doesn’t expand the brackets.
Model args reference
| Key | Type | Default | Description |
|---|---|---|---|
dtype | str | bfloat16 | bfloat16 / float16 / float32 |
device_map | str | auto | auto / cuda / cpu |
batch_size | int | auto | Samples per forward pass (auto-estimated if unset) |
max_context_length | int | 4096 | Token budget |
attn_implementation | str | — | flash_attention_2 / sdpa |
trust_remote_code | bool | true | Allow custom model code |
use_chat_template | bool | false | Wrap prompts with the tokenizer chat template |
vLLM backend
For faster throughput on large models, use the vLLM backend (requirespip install -e ".[vllm]"):
API backend (LiteLLM)
Run any model accessible via an OpenAI-compatible API. API models support generative tasks only, so usemmlu_pro (chain-of-thought), not the log-prob mmlu:
OPENAI_API_KEY / ANTHROPIC_API_KEY in your environment before running.
Few-shot evaluation
mill eval runs each task at its built-in default n_shots value (MMLU defaults to 5, MMLU-Pro to 0). Few-shot examples are pulled from the task’s designated few-shot split (e.g. MMLU’s dev split):
mill eval has no n-shot flag — it always uses the task’s configured default. To sweep multiple n-shot values, use mill schedule with --n_shots 0,5, which launches one job per value.Limiting samples (smoke tests)
Pass--limit to cap the number of samples evaluated per task — useful for quick iteration:
Caching
Mill writes results to Apache Feather files inoutput_dir. On a re-run, completed (model, task, n_shot) jobs are automatically skipped — you only pay for new work.