Text Evaluation

Local HuggingFace model

The default backend loads models via transformers.AutoModelForCausalLM.

mill eval "meta-llama/Meta-Llama-3-8B-Instruct[dtype=bfloat16,batch_size=8]" mmlu,mmlu_pro \
  --output_dir ./results

Model args go inline in brackets (key=value, comma-separated); quote the spec so your shell doesn’t expand the brackets.

Model args reference

Key	Type	Default	Description
`dtype`	str	`bfloat16`	`bfloat16` / `float16` / `float32`
`device_map`	str	`auto`	`auto` / `cuda` / `cpu`
`batch_size`	int	auto	Samples per forward pass (auto-estimated if unset)
`max_context_length`	int	4096	Token budget
`attn_implementation`	str	—	`flash_attention_2` / `sdpa`
`trust_remote_code`	bool	`true`	Allow custom model code
`use_chat_template`	bool	`false`	Wrap prompts with the tokenizer chat template

vLLM backend

For faster throughput on large models, use the vLLM backend (requires pip install -e ".[vllm]"):

mill --output_dir ./results eval \
  "vllm[path=meta-llama/Meta-Llama-3-8B-Instruct,dtype=bfloat16]" mmlu

API backend (LiteLLM)

Run any model accessible via an OpenAI-compatible API. API models support generative tasks only, so use mmlu_pro (chain-of-thought), not the log-prob mmlu:

# OpenAI
mill eval "litellm[model=gpt-4o]" mmlu_pro \
  --output_dir ./results

# Anthropic
mill eval "litellm[model=claude-3-5-sonnet-20241022]" mmlu_pro \
  --output_dir ./results

Set OPENAI_API_KEY / ANTHROPIC_API_KEY in your environment before running.

Few-shot evaluation

mill eval runs each task at its built-in default n_shots value (MMLU defaults to 5, MMLU-Pro to 0). Few-shot examples are pulled from the task’s designated few-shot split (e.g. MMLU’s dev split):

mill --output_dir ./results eval "meta-llama/Meta-Llama-3-8B-Instruct[batch_size=4]" mmlu

mill eval has no n-shot flag — it always uses the task’s configured default. To sweep multiple n-shot values, use mill schedule with --n_shots 0,5, which launches one job per value.

Limiting samples (smoke tests)

Pass --limit to cap the number of samples evaluated per task — useful for quick iteration:

mill --limit 50 eval meta-llama/Meta-Llama-3-8B-Instruct mmlu \
  --output_dir ./results

Caching

Mill writes results to Apache Feather files in output_dir. On a re-run, completed (model, task, n_shot) jobs are automatically skipped — you only pay for new work.

Viewing results

# All results
mill --output_dir ./results collect

# Filter to a specific metric
mill --output_dir ./results collect --metric acc

# Check for missing jobs in a sweep
mill --output_dir ./results collect \
  --models meta-llama/Meta-Llama-3-8B-Instruct \
  --tasks mmlu,mmlu_pro \
  --n_shots 0,5

​Local HuggingFace model

​Model args reference

​vLLM backend

​API backend (LiteLLM)

​Few-shot evaluation

​Limiting samples (smoke tests)

​Caching

​Viewing results

Local HuggingFace model

Model args reference

vLLM backend

API backend (LiteLLM)

Few-shot evaluation

Limiting samples (smoke tests)

Caching

Viewing results