ImageNet - Mill

ImageNet-1k is the standard 1000-class image classification benchmark (50K validation images). Mill registers it as one benchmark with two renderings, picked by model capability: CLIP-style models run zero-shot classification — each image is scored against the 1000 class names by image-text similarity, ensembling 80 prompt templates per class — and vision-language models run a generative multiple-choice variant, where the true class plus 9 random distractors are shown as lettered options (A–J) and the answer letter is parsed from the generation.

Evaluation configuration

Hyperparameter	Value
Benchmark	`imagenet` (`haideraltahan/wds_imagenet1k`, `test` split, 50K images)
Variants	`imagenet` (zero-shot, CLIP) · `imagenet_mcq` (generative MCQ, VLM)
n-shots	`0`
Output type	Zero-shot classification (image-text similarity) · `GENERATIVE` for the MCQ variant
Metric	`acc` (zero-shot) · `imagenet_mcq_acc` (MCQ)
Prompt templates	80 OpenAI CLIP templates, ensembled per class
Seed	`42` (fixes the MCQ distractor draw and option shuffle so runs are reproducible)
Precision (dtype)	`bfloat16`
Backend	open_clip (`clip`) for zero-shot · `hf` / `vllm` for the MCQ variant

Class labels

Mill scores classification against the class names, so each of the 1000 classes must be a distinct, single label. The names are OpenAI’s curated CLIP labels (for example, the WordNet “crane” is already split into crane bird and construction crane). Two of those curated names still collided — identical strings get identical image-text similarity, so the correct class could not reliably win — and Mill disambiguates them to their own WordNet first synonym:

Class	Dataset label	Mill label
657	`missile`	`missile`
744	`missile`	`projectile`
836	`sunglasses`	`sunglass`
837	`sunglasses`	`sunglasses`

Only these two entries change; the other 998 are identical to the clip_benchmark export, so zero-shot scores stay directly comparable. A regression test asserts all 1000 class names are unique, single labels.

Reproduce

# Zero-shot classification (CLIP)
mill --output_dir ./results eval \
  "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" imagenet

# Generative multiple-choice (vision-language model)
mill --output_dir ./results eval \
  "Qwen/Qwen3-VL-2B-Instruct[dtype=bfloat16]" imagenet --seed 42

mill --output_dir ./results collect --metric acc

Results

Results pending a full evaluation run. After running the command above, fill in the table below from the imagenet rollup row of your aggregate.csv, using the open_clip results as the reported baseline for CLIP zero-shot top-1.

Model	Mill (`acc`)	Reported	Source	Δ
`ViT-B-32 / laion2b_s34b_b79k`	—	—	open_clip results	—

​Evaluation configuration

​Class labels

​Reproduce

​Results

Evaluation configuration

Class labels

Reproduce

Results