Built-in benchmarks
Mill is in alpha. Five benchmarks ship today, spanning text and vision. Add your own with custom tasks or the Contributing guide.
| Benchmark | Task name(s) | Modality | Task type | Default n-shot | Metric |
|---|
mmlu | mmlu_<subject> (57 subjects) | Text | MULTIPLE_CHOICE (log-prob) | 5 | acc |
mmlu_pro | mmlu_pro | Text | MULTIPLE_CHOICE (generative CoT) | 0 | mmlu_pro_acc |
cifar10 | cifar10, cifar10_mcq | Image | zero-shot / generative MCQ | 0 | acc / cifar10_mcq_acc |
imagenet | imagenet, imagenet_mcq | Image | zero-shot / generative MCQ | 0 | acc / imagenet_mcq_acc |
mmmu_pro | mmmu_pro, mmmu_pro_clip | Image + Text | generative CoT / zero-shot | 0 | mmmu_pro_acc / acc |
clotho_aqa | clotho_aqa | Audio + Text | GENERATIVE_QA (single word) | 0 | clotho_aqa_exact_match |
Pass either the benchmark name (mmlu) or an individual task (mmlu_abstract_algebra) to mill eval. Browse everything interactively with mill ls.
The vision benchmarks ship in two renderings of the same data. The benchmark sets pick_variant_by_model=True, so Mill automatically runs the rendering your model supports: the zero-shot task for CLIP-style encoders (image↔text similarity) and the generative multiple-choice task for vision-language models (which answer with a letter). You pass the benchmark name; Mill picks the variant.
Task types
task_type is the primary axis of a task — it declares what the task asks and decides which model interface serves it. output_type is a secondary scoring detail of the generative family only.
task_type | Served by | Notes |
|---|
MULTIPLE_CHOICE | LLM / VLM | Pick one option. Scored by generated answer letter (output_type=GENERATIVE) or per-choice log-probability (output_type=LOGPROBS). |
GENERATIVE_QA | LLM / VLM | Free-text generation; a scorer extracts/grades the answer. |
PERPLEXITY | LLM | Rolling log-likelihood over a sequence. |
ZERO_SHOT_CLASSIFICATION | CLIP-style encoder | Score an input against candidate text labels by embedding similarity. |
SUPERVISED_CLASSIFICATION | timm / fixed-head | Predict over a fixed pretrained label set (e.g. ImageNet-1k). |
The matching output types for the generative family:
output_type | When to use |
|---|
GENERATIVE | Free-text generation — model writes an answer and you extract it |
LOGPROBS | Multiple-choice — rank answer options by log-probability |
PERPLEXITY | Measure rolling log-probability of a document |
MillTaskConfig fields
Define a task by creating a MillTaskConfig and exporting it in a TASKS_TABLE list:
from mill.api.instance import OutputType
from mill.api.metrics import get_metric
from mill.api.task import Doc, MillTaskConfig
from mill.api.taxonomy import TaskType
def my_prompt(row: dict) -> Doc:
return Doc(
query=f"Answer the following: {row['question']}",
choices=["A", "B", "C", "D"],
target_index=row["answer"],
)
my_task = MillTaskConfig(
name="my_task",
version=1,
hf_repo="my-org/my-dataset",
hf_subset="default",
hf_avail_splits=["train", "test"],
evaluation_splits=["test"],
few_shots_split="train",
prompt_function=my_prompt,
task_type=TaskType.MULTIPLE_CHOICE,
output_type=OutputType.LOGPROBS,
n_shots=5,
metrics=[get_metric("acc")],
description="My custom task description.",
categories=["reasoning"],
capabilities=["domain knowledge"],
paper_url="https://arxiv.org/abs/0000.00000",
approx_num_samples={"test": 1000},
)
TASKS_TABLE = [my_task]
Key fields
| Field | Type | Description |
|---|
name | str | Unique task name used in CLI args |
hf_repo | str | HuggingFace dataset repo |
hf_subset | str | Dataset configuration / subset |
hf_builder / hf_data_files | str / dict | For packaged builders (e.g. WebDataset exports): hf_builder="webdataset", hf_data_files={split: "hf://.../*.tar"} |
evaluation_splits | list[str] | Splits used for scoring |
few_shots_split | str | Split from which few-shot examples are drawn |
prompt_function | callable | (row: dict) -> Doc |
task_type | TaskType | Primary axis (see table above). Inferred from output_type if omitted |
output_type | OutputType | GENERATIVE, LOGPROBS, or PERPLEXITY (generative family) |
input_modalities | list[str] | Modalities the task feeds the model, e.g. ["image", "text"]. Models that can’t ingest them are rejected |
zeroshot_templates | list[str] | Zero-shot classification: prompt templates with a {c} classname slot, ensembled per class |
generation_size | int | Max new tokens (generative only) |
stop_sequences | list[str] | Stop strings for generation |
n_shots | int | Default few-shot count |
metrics | list[Metric] | Scoring functions |
description, categories, capabilities, paper_url, approx_num_samples | — | Documentation shown in mill ls and the docs |
The Doc dataclass
prompt_function must return a Doc:
@dataclass
class Doc:
query: str # assembled text prompt
choices: list[str] | None # answer options (MCQ / zero-shot labels)
target_index: int | list[int] | str | None # gold index or gold letter/string
visuals: list | None # PIL Images, paths, or URLs
audios: list | None # audio paths or bytes
videos: list | None # video paths
instruction: str | None # optional system prompt
metadata: dict # task-specific data (ids, options, splits...)
task_name: str # owning task name
Multimodal tasks
Populate Doc.visuals, Doc.audios, or Doc.videos alongside query, and declare input_modalities so Mill only runs models that can ingest them:
def vqa_prompt(row: dict) -> Doc:
return Doc(
query=row["question"],
visuals=[row["image"]], # PIL.Image, file path, or URL
target_index=row["answer"],
)
Registering custom tasks
Point Mill at a directory containing your task file(s):
mill eval meta-llama/Meta-Llama-3-8B-Instruct my_task \
--task_paths /home/user/my-tasks \
--output_dir ./results
Mill auto-discovers any file that exports a TASKS_TABLE list (task files inside mill/tasks/ are discovered automatically).
Defining a benchmark
Group tasks under a benchmark name for cleaner CLI usage:
from mill.api.task import MillBenchmarkConfig
my_benchmark = MillBenchmarkConfig(
name="my_benchmark",
task_names=["my_task_a", "my_task_b"],
metric_names=["acc"],
weighted_aggregate=False, # unweighted mean across tasks
pick_variant_by_model=False, # True = task_names are alternative renderings, not subtasks
)
BENCHMARKS_TABLE = [my_benchmark]
Set pick_variant_by_model=True when task_names are mutually-exclusive renderings of the same benchmark (e.g. a CLIP zero-shot task and a VLM generative-MCQ task): Mill runs the single variant whose task_type the model supports, instead of aggregating them. See mill/tasks/cifar10/task.py for a complete example.
Adding a benchmark is a guided, end-to-end process — locating the source benchmark, mirroring how it scores, validating against the published number, and documenting it. See the Contributing guide, which is backed by the adding-a-benchmark skill in the repo.