Models - Mill

Backend overview

Backend	Registry name(s)	Install	Use case
HuggingFace Transformers	`hf`, `huggingface`, `transformers`	(core)	Local LLMs/VLMs, text + multimodal
vLLM	`vllm`	`pip install -e ".[vllm]"`	High-throughput local generation
LiteLLM	`litellm`, `openai`, `api`	`pip install -e ".[litellm]"`	OpenAI, Anthropic, and 100+ API providers
open_clip (CLIP)	`clip`, `open_clip`, `openclip`	`pip install -e ".[clip]"`	Zero-shot image classification / retrieval
timm	`timm`, `pytorch-image-models`	`pip install -e ".[timm]"`	Supervised vision classification (fixed head)

Which backend serves which tasks

A model can only run tasks whose task_type its interface supports — Mill rejects mismatches up front with a clear error rather than producing wrong numbers.

Backend	Task types it serves	When to use it
HF / vLLM	`GENERATIVE_QA`, `MULTIPLE_CHOICE`, `PERPLEXITY`	Any LLM or VLM, text or multimodal. vLLM when you want throughput; HF for the widest model coverage and multimodal processors.
LiteLLM	`GENERATIVE_QA`, `MULTIPLE_CHOICE` (generative only)	Hosted API models. Generative tasks only — no log-prob/perplexity over an API.
open_clip	`ZERO_SHOT_CLASSIFICATION`	CLIP-family image–text models for zero-shot classification (e.g. CIFAR-10, ImageNet, the `*_clip` task variants).
timm	`SUPERVISED_CLASSIFICATION`	Vision models with a fixed pretrained head (e.g. ResNet on ImageNet-1k).

LiteLLM (API) models support generative tasks only — log-probability and perplexity scoring aren’t available over an API. Use them with generative benchmarks like mmlu_pro, not log-prob ones like mmlu.

Specifying a model

By HF model ID (shorthand)

mill eval meta-llama/Meta-Llama-3-8B-Instruct mmlu

Mill infers type=hf when the argument is a HuggingFace model path.

By backend name with inline args

Pass model arguments inline in brackets, key=value separated by commas. Quote the spec so your shell doesn’t interpret the brackets:

mill eval "litellm[model=gpt-4o]" mmlu_pro

By Python config file

mill eval mill/models/configs/qwen/qwen2_5_vl_7b.py mmlu

Mill calls load_model_from_file() on the path and uses the returned dict.

HuggingFace Transformers

Supports text-only and multimodal models via AutoModelForCausalLM + AutoProcessor.

Inline args

Key	Type	Default	Description
`path`	str	required	HF model ID or local path
`modalities`	list[str]	`["text"]`	Modalities handled, e.g. `["text", "image", "video", "audio"]`. Set via a config file — a list can’t be passed inline in brackets
`dtype`	str	`bfloat16`	`bfloat16` / `float16` / `float32`
`device_map`	str	`auto`	`auto` / `cuda` / `cpu`
`max_context_length`	int	`4096`	Token budget
`batch_size`	int	auto	Samples per forward pass (auto-estimated from GPU memory if unset)
`attn_implementation`	str	—	`flash_attention_2` / `sdpa`
`trust_remote_code`	bool	`true`	Allow custom model code from HF
`use_chat_template`	bool	`false`	Wrap prompts with the tokenizer chat template

Example

mill eval "hf[path=meta-llama/Meta-Llama-3-8B-Instruct,dtype=bfloat16,batch_size=4]" mmlu \
  --output_dir ./results

vLLM

High-throughput inference backend. Requires pip install -e ".[vllm]".

mill eval "vllm[path=meta-llama/Meta-Llama-3-8B-Instruct,dtype=bfloat16]" mmlu \
  --output_dir ./results

vLLM-specific args: gpu_memory_utilization (default 0.9), tensor_parallel_size (default 1), and max_model_len (override the model’s max sequence length).

LiteLLM

Wraps any OpenAI-compatible API. Requires pip install -e ".[litellm]".

# OpenAI
OPENAI_API_KEY=sk-... mill eval "litellm[model=gpt-4o]" mmlu_pro \
  --output_dir ./results

# Anthropic
ANTHROPIC_API_KEY=sk-ant-... mill eval "litellm[model=claude-3-5-sonnet-20241022]" mmlu_pro \
  --output_dir ./results

Pass any LiteLLM completion parameter as an inline arg. API models run generative tasks only (e.g. mmlu_pro) — log-prob benchmarks like mmlu aren’t supported over an API.

open_clip (CLIP)

CLIP-style zero-shot image classification via open_clip. Each request carries an image and candidate text labels; the model returns the best-matching label by image–text cosine similarity, ensembling the task’s prompt templates per class. Requirements: pip install -e ".[clip]" (or .[vision] for CLIP + timm). When to use: zero-shot image benchmarks (cifar10, imagenet) and the CLIP renderings of multimodal MCQ benchmarks (mmmu_pro_clip). Use a vision-language model through the HF/vLLM backends instead if you want generated, instruction-style answers.

mill eval "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" cifar10 \
  --output_dir ./results

Inline args

Key	Type	Default	Description
`path`	str	required	open_clip architecture name, e.g. `ViT-B-32`
`pretrained`	str	—	open_clip weights tag, e.g. `laion2b_s34b_b79k`
`batch_size`	int	`64`	Images per forward pass
`prompt_template`	str	`"a photo of a {c}."`	Fallback template when a task sets none
`max_context_length`	int	`77`	CLIP text context length

path + pretrained together form the model identity used for output caching, so two weight sets of the same architecture stay distinct in your results.

timm

Vision-only supervised classification via timm. The model predicts over its fixed pretrained head, so the task’s labels must use the same class space (e.g. ImageNet-1k). Requirements: pip install -e ".[timm]" (or .[vision] for CLIP + timm). When to use: classic supervised vision baselines (e.g. a ResNet on ImageNet). Unlike CLIP, it does not score against arbitrary text labels — predictions are an argmax over the model’s built-in classes.

mill eval "timm[path=resnet50.a1_in1k]" imagenet \
  --output_dir ./results

Inline args

Key	Type	Default	Description
`path`	str	required	Any timm model name, e.g. `resnet50.a1_in1k`
`pretrained`	bool	`true`	Load pretrained weights
`batch_size`	int	`64`	Images per forward pass
`num_classes`	int	model default	Override the classifier head size

Python config files

Config files let you version-control exact model settings and share them across runs. Place them anywhere and pass the path to mill eval.

# my_model.py — returned dict mirrors TransformersModel.__init__ kwargs
model = {
    "type": "hf",
    "path": "Qwen/Qwen2.5-VL-7B-Instruct",
    "modalities": ["text", "image"],
    "dtype": "bfloat16",
    "device_map": "auto",
    "max_context_length": 8192,
    "batch_size": 4,
    "use_chat_template": True,
}

Built-in configs live under mill/models/configs/:

Family	Path
Qwen2.5-VL	`mill/models/configs/qwen/`
InternVL	`mill/models/configs/internvl/`
Llama	`mill/models/configs/llama/`

Writing a custom backend

Subclass MillModel and register it. Implement the three batch hooks plus the model_name property — the base class wraps them with batching, progress bars, and automatic OOM retry, exposing the public generate_until, loglikelihood, and loglikelihood_rolling methods the evaluator calls:

from mill.api.model import MillModel, ModelCapabilities
from mill.api.registry import register_model

@register_model("my-backend")
class MyModel(MillModel):
    def __init__(self, path: str, **kwargs):
        self._path = path
        self.capabilities = ModelCapabilities(
            modalities={"text"},
            max_context_length=4096,
            supports_logprobs=True,
            supports_chat_template=False,
        )
        # load your model here

    @property
    def model_name(self) -> str:
        return self._path

    def _generate_batch(self, batch, gen_kwargs) -> list[str]:
        ...

    def _loglikelihood_batch(self, batch) -> list[tuple[float, bool]]:
        ...

    def _loglikelihood_rolling_single(self, request) -> float:
        ...

Once registered, use my-backend as the model name in mill eval.

​Backend overview

​Which backend serves which tasks

​Specifying a model

​By HF model ID (shorthand)

​By backend name with inline args

​By Python config file

​HuggingFace Transformers

​Inline args

​Example

​vLLM

​LiteLLM

​open_clip (CLIP)

​Inline args

​timm

​Inline args

​Python config files

​Writing a custom backend

Backend overview

Which backend serves which tasks

Specifying a model

By HF model ID (shorthand)

By backend name with inline args

By Python config file

HuggingFace Transformers

Inline args

Example

vLLM

LiteLLM

open_clip (CLIP)

Inline args

timm

Inline args

Python config files

Writing a custom backend