> ## Documentation Index
> Fetch the complete documentation index at: https://pymill.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Text Evaluation

> Run text benchmarks with local HF models, vLLM, or API backends.

## Local HuggingFace model

The default backend loads models via `transformers.AutoModelForCausalLM`.

```bash theme={null}
mill eval "meta-llama/Meta-Llama-3-8B-Instruct[dtype=bfloat16,batch_size=8]" mmlu,mmlu_pro \
  --output_dir ./results
```

Model args go inline in brackets (`key=value`, comma-separated); quote the spec so your shell doesn't expand the brackets.

### Model args reference

| Key                   | Type | Default    | Description                                        |
| --------------------- | ---- | ---------- | -------------------------------------------------- |
| `dtype`               | str  | `bfloat16` | `bfloat16` / `float16` / `float32`                 |
| `device_map`          | str  | `auto`     | `auto` / `cuda` / `cpu`                            |
| `batch_size`          | int  | auto       | Samples per forward pass (auto-estimated if unset) |
| `max_context_length`  | int  | 4096       | Token budget                                       |
| `attn_implementation` | str  | —          | `flash_attention_2` / `sdpa`                       |
| `trust_remote_code`   | bool | `true`     | Allow custom model code                            |
| `use_chat_template`   | bool | `false`    | Wrap prompts with the tokenizer chat template      |

## vLLM backend

For faster throughput on large models, use the vLLM backend (requires `pip install -e ".[vllm]"`):

```bash theme={null}
mill --output_dir ./results eval \
  "vllm[path=meta-llama/Meta-Llama-3-8B-Instruct,dtype=bfloat16]" mmlu
```

## API backend (LiteLLM)

Run any model accessible via an OpenAI-compatible API. API models support **generative tasks only**, so use `mmlu_pro` (chain-of-thought), not the log-prob `mmlu`:

```bash theme={null}
# OpenAI
mill eval "litellm[model=gpt-4o]" mmlu_pro \
  --output_dir ./results

# Anthropic
mill eval "litellm[model=claude-3-5-sonnet-20241022]" mmlu_pro \
  --output_dir ./results
```

Set `OPENAI_API_KEY` / `ANTHROPIC_API_KEY` in your environment before running.

## Few-shot evaluation

`mill eval` runs each task at its built-in default `n_shots` value (MMLU defaults to 5, MMLU-Pro to 0). Few-shot examples are pulled from the task's designated few-shot split (e.g. MMLU's `dev` split):

```bash theme={null}
mill --output_dir ./results eval "meta-llama/Meta-Llama-3-8B-Instruct[batch_size=4]" mmlu
```

<Note>
  `mill eval` has no n-shot flag — it always uses the task's configured default. To sweep multiple n-shot values, use [`mill schedule`](/docs/guides/distributed) with `--n_shots 0,5`, which launches one job per value.
</Note>

## Limiting samples (smoke tests)

Pass `--limit` to cap the number of samples evaluated per task — useful for quick iteration:

```bash theme={null}
mill --limit 50 eval meta-llama/Meta-Llama-3-8B-Instruct mmlu \
  --output_dir ./results
```

## Caching

Mill writes results to [Apache Feather](https://arrow.apache.org/docs/python/feather.html) files in `output_dir`. On a re-run, completed `(model, task, n_shot)` jobs are automatically skipped — you only pay for new work.

## Viewing results

```bash theme={null}
# All results
mill --output_dir ./results collect

# Filter to a specific metric
mill --output_dir ./results collect --metric acc

# Check for missing jobs in a sweep
mill --output_dir ./results collect \
  --models meta-llama/Meta-Llama-3-8B-Instruct \
  --tasks mmlu,mmlu_pro \
  --n_shots 0,5
```