Skip to main content
ImageNet-1k is the standard 1000-class image classification benchmark (50K validation images). Mill registers it as one benchmark with two renderings, picked by model capability: CLIP-style models run zero-shot classification — each image is scored against the 1000 class names by image-text similarity, ensembling 80 prompt templates per class — and vision-language models run a generative multiple-choice variant, where the true class plus 9 random distractors are shown as lettered options (A–J) and the answer letter is parsed from the generation.

Evaluation configuration

HyperparameterValue
Benchmarkimagenet (haideraltahan/wds_imagenet1k, test split, 50K images)
Variantsimagenet (zero-shot, CLIP) · imagenet_mcq (generative MCQ, VLM)
n-shots0
Output typeZero-shot classification (image-text similarity) · GENERATIVE for the MCQ variant
Metricacc (zero-shot) · imagenet_mcq_acc (MCQ)
Prompt templates80 OpenAI CLIP templates, ensembled per class
Seed42 (fixes the MCQ distractor draw and option shuffle so runs are reproducible)
Precision (dtype)bfloat16
Backendopen_clip (clip) for zero-shot · hf / vllm for the MCQ variant

Class labels

Mill scores classification against the class names, so each of the 1000 classes must be a distinct, single label. The names are OpenAI’s curated CLIP labels (for example, the WordNet “crane” is already split into crane bird and construction crane). Two of those curated names still collided — identical strings get identical image-text similarity, so the correct class could not reliably win — and Mill disambiguates them to their own WordNet first synonym:
ClassDataset labelMill label
657missilemissile
744missileprojectile
836sunglassessunglass
837sunglassessunglasses
Only these two entries change; the other 998 are identical to the clip_benchmark export, so zero-shot scores stay directly comparable. A regression test asserts all 1000 class names are unique, single labels.

Reproduce

# Zero-shot classification (CLIP)
mill --output_dir ./results eval \
  "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" imagenet

# Generative multiple-choice (vision-language model)
mill --output_dir ./results eval \
  "Qwen/Qwen3-VL-2B-Instruct[dtype=bfloat16]" imagenet --seed 42

mill --output_dir ./results collect --metric acc

Results

Results pending a full evaluation run. After running the command above, fill in the table below from the imagenet rollup row of your aggregate.csv, using the open_clip results as the reported baseline for CLIP zero-shot top-1.
ModelMill (acc)ReportedSourceΔ
ViT-B-32 / laion2b_s34b_b79kopen_clip results