Skip to main content

Built-in benchmarks

Mill is in alpha. Five benchmarks ship today, spanning text and vision. Add your own with custom tasks or the Contributing guide.
BenchmarkTask name(s)ModalityTask typeDefault n-shotMetric
mmlummlu_<subject> (57 subjects)TextMULTIPLE_CHOICE (log-prob)5acc
mmlu_prommlu_proTextMULTIPLE_CHOICE (generative CoT)0mmlu_pro_acc
cifar10cifar10, cifar10_mcqImagezero-shot / generative MCQ0acc / cifar10_mcq_acc
imagenetimagenet, imagenet_mcqImagezero-shot / generative MCQ0acc / imagenet_mcq_acc
mmmu_prommmu_pro, mmmu_pro_clipImage + Textgenerative CoT / zero-shot0mmmu_pro_acc / acc
clotho_aqaclotho_aqaAudio + TextGENERATIVE_QA (single word)0clotho_aqa_exact_match
Pass either the benchmark name (mmlu) or an individual task (mmlu_abstract_algebra) to mill eval. Browse everything interactively with mill ls.
The vision benchmarks ship in two renderings of the same data. The benchmark sets pick_variant_by_model=True, so Mill automatically runs the rendering your model supports: the zero-shot task for CLIP-style encoders (image↔text similarity) and the generative multiple-choice task for vision-language models (which answer with a letter). You pass the benchmark name; Mill picks the variant.

Task types

task_type is the primary axis of a task — it declares what the task asks and decides which model interface serves it. output_type is a secondary scoring detail of the generative family only.
task_typeServed byNotes
MULTIPLE_CHOICELLM / VLMPick one option. Scored by generated answer letter (output_type=GENERATIVE) or per-choice log-probability (output_type=LOGPROBS).
GENERATIVE_QALLM / VLMFree-text generation; a scorer extracts/grades the answer.
PERPLEXITYLLMRolling log-likelihood over a sequence.
ZERO_SHOT_CLASSIFICATIONCLIP-style encoderScore an input against candidate text labels by embedding similarity.
SUPERVISED_CLASSIFICATIONtimm / fixed-headPredict over a fixed pretrained label set (e.g. ImageNet-1k).
The matching output types for the generative family:
output_typeWhen to use
GENERATIVEFree-text generation — model writes an answer and you extract it
LOGPROBSMultiple-choice — rank answer options by log-probability
PERPLEXITYMeasure rolling log-probability of a document

MillTaskConfig fields

Define a task by creating a MillTaskConfig and exporting it in a TASKS_TABLE list:
from mill.api.instance import OutputType
from mill.api.metrics import get_metric
from mill.api.task import Doc, MillTaskConfig
from mill.api.taxonomy import TaskType

def my_prompt(row: dict) -> Doc:
    return Doc(
        query=f"Answer the following: {row['question']}",
        choices=["A", "B", "C", "D"],
        target_index=row["answer"],
    )

my_task = MillTaskConfig(
    name="my_task",
    version=1,
    hf_repo="my-org/my-dataset",
    hf_subset="default",
    hf_avail_splits=["train", "test"],
    evaluation_splits=["test"],
    few_shots_split="train",
    prompt_function=my_prompt,
    task_type=TaskType.MULTIPLE_CHOICE,
    output_type=OutputType.LOGPROBS,
    n_shots=5,
    metrics=[get_metric("acc")],
    description="My custom task description.",
    categories=["reasoning"],
    capabilities=["domain knowledge"],
    paper_url="https://arxiv.org/abs/0000.00000",
    approx_num_samples={"test": 1000},
)

TASKS_TABLE = [my_task]

Key fields

FieldTypeDescription
namestrUnique task name used in CLI args
hf_repostrHuggingFace dataset repo
hf_subsetstrDataset configuration / subset
hf_builder / hf_data_filesstr / dictFor packaged builders (e.g. WebDataset exports): hf_builder="webdataset", hf_data_files={split: "hf://.../*.tar"}
evaluation_splitslist[str]Splits used for scoring
few_shots_splitstrSplit from which few-shot examples are drawn
prompt_functioncallable(row: dict) -> Doc
task_typeTaskTypePrimary axis (see table above). Inferred from output_type if omitted
output_typeOutputTypeGENERATIVE, LOGPROBS, or PERPLEXITY (generative family)
input_modalitieslist[str]Modalities the task feeds the model, e.g. ["image", "text"]. Models that can’t ingest them are rejected
zeroshot_templateslist[str]Zero-shot classification: prompt templates with a {c} classname slot, ensembled per class
generation_sizeintMax new tokens (generative only)
stop_sequenceslist[str]Stop strings for generation
n_shotsintDefault few-shot count
metricslist[Metric]Scoring functions
description, categories, capabilities, paper_url, approx_num_samplesDocumentation shown in mill ls and the docs

The Doc dataclass

prompt_function must return a Doc:
@dataclass
class Doc:
    query: str                       # assembled text prompt
    choices: list[str] | None        # answer options (MCQ / zero-shot labels)
    target_index: int | list[int] | str | None  # gold index or gold letter/string
    visuals: list | None             # PIL Images, paths, or URLs
    audios: list | None              # audio paths or bytes
    videos: list | None              # video paths
    instruction: str | None          # optional system prompt
    metadata: dict                   # task-specific data (ids, options, splits...)
    task_name: str                   # owning task name

Multimodal tasks

Populate Doc.visuals, Doc.audios, or Doc.videos alongside query, and declare input_modalities so Mill only runs models that can ingest them:
def vqa_prompt(row: dict) -> Doc:
    return Doc(
        query=row["question"],
        visuals=[row["image"]],      # PIL.Image, file path, or URL
        target_index=row["answer"],
    )

Registering custom tasks

Point Mill at a directory containing your task file(s):
mill eval meta-llama/Meta-Llama-3-8B-Instruct my_task \
  --task_paths /home/user/my-tasks \
  --output_dir ./results
Mill auto-discovers any file that exports a TASKS_TABLE list (task files inside mill/tasks/ are discovered automatically).

Defining a benchmark

Group tasks under a benchmark name for cleaner CLI usage:
from mill.api.task import MillBenchmarkConfig

my_benchmark = MillBenchmarkConfig(
    name="my_benchmark",
    task_names=["my_task_a", "my_task_b"],
    metric_names=["acc"],
    weighted_aggregate=False,      # unweighted mean across tasks
    pick_variant_by_model=False,   # True = task_names are alternative renderings, not subtasks
)

BENCHMARKS_TABLE = [my_benchmark]
Set pick_variant_by_model=True when task_names are mutually-exclusive renderings of the same benchmark (e.g. a CLIP zero-shot task and a VLM generative-MCQ task): Mill runs the single variant whose task_type the model supports, instead of aggregating them. See mill/tasks/cifar10/task.py for a complete example.
Adding a benchmark is a guided, end-to-end process — locating the source benchmark, mirroring how it scores, validating against the published number, and documenting it. See the Contributing guide, which is backed by the adding-a-benchmark skill in the repo.