Skip to main content

Local HuggingFace model

The default backend loads models via transformers.AutoModelForCausalLM.
mill eval "meta-llama/Meta-Llama-3-8B-Instruct[dtype=bfloat16,batch_size=8]" mmlu,mmlu_pro \
  --output_dir ./results
Model args go inline in brackets (key=value, comma-separated); quote the spec so your shell doesn’t expand the brackets.

Model args reference

KeyTypeDefaultDescription
dtypestrbfloat16bfloat16 / float16 / float32
device_mapstrautoauto / cuda / cpu
batch_sizeintautoSamples per forward pass (auto-estimated if unset)
max_context_lengthint4096Token budget
attn_implementationstrflash_attention_2 / sdpa
trust_remote_codebooltrueAllow custom model code
use_chat_templateboolfalseWrap prompts with the tokenizer chat template

vLLM backend

For faster throughput on large models, use the vLLM backend (requires pip install -e ".[vllm]"):
mill --output_dir ./results eval \
  "vllm[path=meta-llama/Meta-Llama-3-8B-Instruct,dtype=bfloat16]" mmlu

API backend (LiteLLM)

Run any model accessible via an OpenAI-compatible API. API models support generative tasks only, so use mmlu_pro (chain-of-thought), not the log-prob mmlu:
# OpenAI
mill eval "litellm[model=gpt-4o]" mmlu_pro \
  --output_dir ./results

# Anthropic
mill eval "litellm[model=claude-3-5-sonnet-20241022]" mmlu_pro \
  --output_dir ./results
Set OPENAI_API_KEY / ANTHROPIC_API_KEY in your environment before running.

Few-shot evaluation

mill eval runs each task at its built-in default n_shots value (MMLU defaults to 5, MMLU-Pro to 0). Few-shot examples are pulled from the task’s designated few-shot split (e.g. MMLU’s dev split):
mill --output_dir ./results eval "meta-llama/Meta-Llama-3-8B-Instruct[batch_size=4]" mmlu
mill eval has no n-shot flag — it always uses the task’s configured default. To sweep multiple n-shot values, use mill schedule with --n_shots 0,5, which launches one job per value.

Limiting samples (smoke tests)

Pass --limit to cap the number of samples evaluated per task — useful for quick iteration:
mill --limit 50 eval meta-llama/Meta-Llama-3-8B-Instruct mmlu \
  --output_dir ./results

Caching

Mill writes results to Apache Feather files in output_dir. On a re-run, completed (model, task, n_shot) jobs are automatically skipped — you only pay for new work.

Viewing results

# All results
mill --output_dir ./results collect

# Filter to a specific metric
mill --output_dir ./results collect --metric acc

# Check for missing jobs in a sweep
mill --output_dir ./results collect \
  --models meta-llama/Meta-Llama-3-8B-Instruct \
  --tasks mmlu,mmlu_pro \
  --n_shots 0,5