Skip to main content

Backend overview

BackendRegistry name(s)InstallUse case
HuggingFace Transformershf, huggingface, transformers(core)Local LLMs/VLMs, text + multimodal
vLLMvllmpip install -e ".[vllm]"High-throughput local generation
LiteLLMlitellm, openai, apipip install -e ".[litellm]"OpenAI, Anthropic, and 100+ API providers
open_clip (CLIP)clip, open_clip, openclippip install -e ".[clip]"Zero-shot image classification / retrieval
timmtimm, pytorch-image-modelspip install -e ".[timm]"Supervised vision classification (fixed head)

Which backend serves which tasks

A model can only run tasks whose task_type its interface supports — Mill rejects mismatches up front with a clear error rather than producing wrong numbers.
BackendTask types it servesWhen to use it
HF / vLLMGENERATIVE_QA, MULTIPLE_CHOICE, PERPLEXITYAny LLM or VLM, text or multimodal. vLLM when you want throughput; HF for the widest model coverage and multimodal processors.
LiteLLMGENERATIVE_QA, MULTIPLE_CHOICE (generative only)Hosted API models. Generative tasks only — no log-prob/perplexity over an API.
open_clipZERO_SHOT_CLASSIFICATIONCLIP-family image–text models for zero-shot classification (e.g. CIFAR-10, ImageNet, the *_clip task variants).
timmSUPERVISED_CLASSIFICATIONVision models with a fixed pretrained head (e.g. ResNet on ImageNet-1k).
LiteLLM (API) models support generative tasks only — log-probability and perplexity scoring aren’t available over an API. Use them with generative benchmarks like mmlu_pro, not log-prob ones like mmlu.

Specifying a model

By HF model ID (shorthand)

mill eval meta-llama/Meta-Llama-3-8B-Instruct mmlu
Mill infers type=hf when the argument is a HuggingFace model path.

By backend name with inline args

Pass model arguments inline in brackets, key=value separated by commas. Quote the spec so your shell doesn’t interpret the brackets:
mill eval "litellm[model=gpt-4o]" mmlu_pro

By Python config file

mill eval mill/models/configs/qwen/qwen2_5_vl_7b.py mmlu
Mill calls load_model_from_file() on the path and uses the returned dict.

HuggingFace Transformers

Supports text-only and multimodal models via AutoModelForCausalLM + AutoProcessor.

Inline args

KeyTypeDefaultDescription
pathstrrequiredHF model ID or local path
modalitieslist[str]["text"]Modalities handled, e.g. ["text", "image", "video", "audio"]. Set via a config file — a list can’t be passed inline in brackets
dtypestrbfloat16bfloat16 / float16 / float32
device_mapstrautoauto / cuda / cpu
max_context_lengthint4096Token budget
batch_sizeintautoSamples per forward pass (auto-estimated from GPU memory if unset)
attn_implementationstrflash_attention_2 / sdpa
trust_remote_codebooltrueAllow custom model code from HF
use_chat_templateboolfalseWrap prompts with the tokenizer chat template

Example

mill eval "hf[path=meta-llama/Meta-Llama-3-8B-Instruct,dtype=bfloat16,batch_size=4]" mmlu \
  --output_dir ./results

vLLM

High-throughput inference backend. Requires pip install -e ".[vllm]".
mill eval "vllm[path=meta-llama/Meta-Llama-3-8B-Instruct,dtype=bfloat16]" mmlu \
  --output_dir ./results
vLLM-specific args: gpu_memory_utilization (default 0.9), tensor_parallel_size (default 1), and max_model_len (override the model’s max sequence length).

LiteLLM

Wraps any OpenAI-compatible API. Requires pip install -e ".[litellm]".
# OpenAI
OPENAI_API_KEY=sk-... mill eval "litellm[model=gpt-4o]" mmlu_pro \
  --output_dir ./results

# Anthropic
ANTHROPIC_API_KEY=sk-ant-... mill eval "litellm[model=claude-3-5-sonnet-20241022]" mmlu_pro \
  --output_dir ./results
Pass any LiteLLM completion parameter as an inline arg. API models run generative tasks only (e.g. mmlu_pro) — log-prob benchmarks like mmlu aren’t supported over an API.

open_clip (CLIP)

CLIP-style zero-shot image classification via open_clip. Each request carries an image and candidate text labels; the model returns the best-matching label by image–text cosine similarity, ensembling the task’s prompt templates per class. Requirements: pip install -e ".[clip]" (or .[vision] for CLIP + timm). When to use: zero-shot image benchmarks (cifar10, imagenet) and the CLIP renderings of multimodal MCQ benchmarks (mmmu_pro_clip). Use a vision-language model through the HF/vLLM backends instead if you want generated, instruction-style answers.
mill eval "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" cifar10 \
  --output_dir ./results

Inline args

KeyTypeDefaultDescription
pathstrrequiredopen_clip architecture name, e.g. ViT-B-32
pretrainedstropen_clip weights tag, e.g. laion2b_s34b_b79k
batch_sizeint64Images per forward pass
prompt_templatestr"a photo of a {c}."Fallback template when a task sets none
max_context_lengthint77CLIP text context length
path + pretrained together form the model identity used for output caching, so two weight sets of the same architecture stay distinct in your results.

timm

Vision-only supervised classification via timm. The model predicts over its fixed pretrained head, so the task’s labels must use the same class space (e.g. ImageNet-1k). Requirements: pip install -e ".[timm]" (or .[vision] for CLIP + timm). When to use: classic supervised vision baselines (e.g. a ResNet on ImageNet). Unlike CLIP, it does not score against arbitrary text labels — predictions are an argmax over the model’s built-in classes.
mill eval "timm[path=resnet50.a1_in1k]" imagenet \
  --output_dir ./results

Inline args

KeyTypeDefaultDescription
pathstrrequiredAny timm model name, e.g. resnet50.a1_in1k
pretrainedbooltrueLoad pretrained weights
batch_sizeint64Images per forward pass
num_classesintmodel defaultOverride the classifier head size

Python config files

Config files let you version-control exact model settings and share them across runs. Place them anywhere and pass the path to mill eval.
# my_model.py — returned dict mirrors TransformersModel.__init__ kwargs
model = {
    "type": "hf",
    "path": "Qwen/Qwen2.5-VL-7B-Instruct",
    "modalities": ["text", "image"],
    "dtype": "bfloat16",
    "device_map": "auto",
    "max_context_length": 8192,
    "batch_size": 4,
    "use_chat_template": True,
}
Built-in configs live under mill/models/configs/:
FamilyPath
Qwen2.5-VLmill/models/configs/qwen/
InternVLmill/models/configs/internvl/
Llamamill/models/configs/llama/

Writing a custom backend

Subclass MillModel and register it. Implement the three batch hooks plus the model_name property — the base class wraps them with batching, progress bars, and automatic OOM retry, exposing the public generate_until, loglikelihood, and loglikelihood_rolling methods the evaluator calls:
from mill.api.model import MillModel, ModelCapabilities
from mill.api.registry import register_model

@register_model("my-backend")
class MyModel(MillModel):
    def __init__(self, path: str, **kwargs):
        self._path = path
        self.capabilities = ModelCapabilities(
            modalities={"text"},
            max_context_length=4096,
            supports_logprobs=True,
            supports_chat_template=False,
        )
        # load your model here

    @property
    def model_name(self) -> str:
        return self._path

    def _generate_batch(self, batch, gen_kwargs) -> list[str]:
        ...

    def _loglikelihood_batch(self, batch) -> list[tuple[float, bool]]:
        ...

    def _loglikelihood_rolling_single(self, request) -> float:
        ...
Once registered, use my-backend as the model name in mill eval.