Tasks - Mill

Built-in benchmarks

Mill is in alpha. Five benchmarks ship today, spanning text and vision. Add your own with custom tasks or the Contributing guide.

Benchmark	Task name(s)	Modality	Task type	Default n-shot	Metric
`mmlu`	`mmlu_<subject>` (57 subjects)	Text	`MULTIPLE_CHOICE` (log-prob)	5	`acc`
`mmlu_pro`	`mmlu_pro`	Text	`MULTIPLE_CHOICE` (generative CoT)	0	`mmlu_pro_acc`
`cifar10`	`cifar10`, `cifar10_mcq`	Image	zero-shot / generative MCQ	0	`acc` / `cifar10_mcq_acc`
`imagenet`	`imagenet`, `imagenet_mcq`	Image	zero-shot / generative MCQ	0	`acc` / `imagenet_mcq_acc`
`mmmu_pro`	`mmmu_pro`, `mmmu_pro_clip`	Image + Text	generative CoT / zero-shot	0	`mmmu_pro_acc` / `acc`
`clotho_aqa`	`clotho_aqa`	Audio + Text	`GENERATIVE_QA` (single word)	0	`clotho_aqa_exact_match`

Pass either the benchmark name (mmlu) or an individual task (mmlu_abstract_algebra) to mill eval. Browse everything interactively with mill ls.

The vision benchmarks ship in two renderings of the same data. The benchmark sets pick_variant_by_model=True, so Mill automatically runs the rendering your model supports: the zero-shot task for CLIP-style encoders (image↔text similarity) and the generative multiple-choice task for vision-language models (which answer with a letter). You pass the benchmark name; Mill picks the variant.

Task types

task_type is the primary axis of a task — it declares what the task asks and decides which model interface serves it. output_type is a secondary scoring detail of the generative family only.

`task_type`	Served by	Notes
`MULTIPLE_CHOICE`	LLM / VLM	Pick one option. Scored by generated answer letter (`output_type=GENERATIVE`) or per-choice log-probability (`output_type=LOGPROBS`).
`GENERATIVE_QA`	LLM / VLM	Free-text generation; a scorer extracts/grades the answer.
`PERPLEXITY`	LLM	Rolling log-likelihood over a sequence.
`ZERO_SHOT_CLASSIFICATION`	CLIP-style encoder	Score an input against candidate text labels by embedding similarity.
`SUPERVISED_CLASSIFICATION`	timm / fixed-head	Predict over a fixed pretrained label set (e.g. ImageNet-1k).

The matching output types for the generative family:

`output_type`	When to use
`GENERATIVE`	Free-text generation — model writes an answer and you extract it
`LOGPROBS`	Multiple-choice — rank answer options by log-probability
`PERPLEXITY`	Measure rolling log-probability of a document

MillTaskConfig fields

Define a task by creating a MillTaskConfig and exporting it in a TASKS_TABLE list:

from mill.api.instance import OutputType
from mill.api.metrics import get_metric
from mill.api.task import Doc, MillTaskConfig
from mill.api.taxonomy import TaskType

def my_prompt(row: dict) -> Doc:
    return Doc(
        query=f"Answer the following: {row['question']}",
        choices=["A", "B", "C", "D"],
        target_index=row["answer"],
    )

my_task = MillTaskConfig(
    name="my_task",
    version=1,
    hf_repo="my-org/my-dataset",
    hf_subset="default",
    hf_avail_splits=["train", "test"],
    evaluation_splits=["test"],
    few_shots_split="train",
    prompt_function=my_prompt,
    task_type=TaskType.MULTIPLE_CHOICE,
    output_type=OutputType.LOGPROBS,
    n_shots=5,
    metrics=[get_metric("acc")],
    description="My custom task description.",
    categories=["reasoning"],
    capabilities=["domain knowledge"],
    paper_url="https://arxiv.org/abs/0000.00000",
    approx_num_samples={"test": 1000},
)

TASKS_TABLE = [my_task]

Key fields

Field	Type	Description
`name`	str	Unique task name used in CLI args
`hf_repo`	str	HuggingFace dataset repo
`hf_subset`	str	Dataset configuration / subset
`hf_builder` / `hf_data_files`	str / dict	For packaged builders (e.g. WebDataset exports): `hf_builder="webdataset"`, `hf_data_files={split: "hf://.../*.tar"}`
`evaluation_splits`	list[str]	Splits used for scoring
`few_shots_split`	str	Split from which few-shot examples are drawn
`prompt_function`	callable	`(row: dict) -> Doc`
`task_type`	TaskType	Primary axis (see table above). Inferred from `output_type` if omitted
`output_type`	OutputType	`GENERATIVE`, `LOGPROBS`, or `PERPLEXITY` (generative family)
`input_modalities`	list[str]	Modalities the task feeds the model, e.g. `["image", "text"]`. Models that can’t ingest them are rejected
`zeroshot_templates`	list[str]	Zero-shot classification: prompt templates with a `{c}` classname slot, ensembled per class
`generation_size`	int	Max new tokens (generative only)
`stop_sequences`	list[str]	Stop strings for generation
`n_shots`	int	Default few-shot count
`metrics`	list[Metric]	Scoring functions
`description`, `categories`, `capabilities`, `paper_url`, `approx_num_samples`	—	Documentation shown in `mill ls` and the docs

The Doc dataclass

prompt_function must return a Doc:

@dataclass
class Doc:
    query: str                       # assembled text prompt
    choices: list[str] | None        # answer options (MCQ / zero-shot labels)
    target_index: int | list[int] | str | None  # gold index or gold letter/string
    visuals: list | None             # PIL Images, paths, or URLs
    audios: list | None              # audio paths or bytes
    videos: list | None              # video paths
    instruction: str | None          # optional system prompt
    metadata: dict                   # task-specific data (ids, options, splits...)
    task_name: str                   # owning task name

Multimodal tasks

Populate Doc.visuals, Doc.audios, or Doc.videos alongside query, and declare input_modalities so Mill only runs models that can ingest them:

def vqa_prompt(row: dict) -> Doc:
    return Doc(
        query=row["question"],
        visuals=[row["image"]],      # PIL.Image, file path, or URL
        target_index=row["answer"],
    )

Registering custom tasks

Point Mill at a directory containing your task file(s):

mill eval meta-llama/Meta-Llama-3-8B-Instruct my_task \
  --task_paths /home/user/my-tasks \
  --output_dir ./results

Mill auto-discovers any file that exports a TASKS_TABLE list (task files inside mill/tasks/ are discovered automatically).

Defining a benchmark

Group tasks under a benchmark name for cleaner CLI usage:

from mill.api.task import MillBenchmarkConfig

my_benchmark = MillBenchmarkConfig(
    name="my_benchmark",
    task_names=["my_task_a", "my_task_b"],
    metric_names=["acc"],
    weighted_aggregate=False,      # unweighted mean across tasks
    pick_variant_by_model=False,   # True = task_names are alternative renderings, not subtasks
)

BENCHMARKS_TABLE = [my_benchmark]

Set pick_variant_by_model=True when task_names are mutually-exclusive renderings of the same benchmark (e.g. a CLIP zero-shot task and a VLM generative-MCQ task): Mill runs the single variant whose task_type the model supports, instead of aggregating them. See mill/tasks/cifar10/task.py for a complete example.

Adding a benchmark is a guided, end-to-end process — locating the source benchmark, mirroring how it scores, validating against the published number, and documenting it. See the Contributing guide, which is backed by the adding-a-benchmark skill in the repo.

​Built-in benchmarks

​Task types

​MillTaskConfig fields

​Key fields

​The Doc dataclass

​Multimodal tasks

​Registering custom tasks

​Defining a benchmark

Built-in benchmarks

Task types

MillTaskConfig fields

Key fields

The Doc dataclass

Multimodal tasks

Registering custom tasks

Defining a benchmark