cifar10— CLIP-style zero-shot classification: the image is scored against the 10 class names by image–text similarity, ensembling 18 prompt templates per class (classnames and templates copied verbatim from clip_benchmark).cifar10_mcq— generative multiple-choice for vision-language models: the model sees the image and the 10 classes as lettered options (shuffled per image) and answers with a letter, which is parsed and graded.
Evaluation configuration
| Hyperparameter | Value |
|---|---|
| Benchmark | cifar10 (auto-picks cifar10 for CLIP, cifar10_mcq for VLMs) |
| Dataset | haideraltahan/wds_cifar10 (test, 10,000 images) |
| n-shots | 0 |
| Task type | ZERO_SHOT_CLASSIFICATION (CLIP) / MULTIPLE_CHOICE generative (VLM) |
| Metric | acc (top-1) / cifar10_mcq_acc |
| Backend | open_clip (clip) / HuggingFace VLM (hf) |
Reproduce
Results
CLIP — ViT-B-32 (laion2b)
93.56% ± 0.25
zero-shot, top-1
zero-shot, top-1
Qwen3-VL-2B-Instruct
95.87% ± 0.20
generative MCQ
generative MCQ
Mill copies clip_benchmark’s CIFAR-10 class names and 18 zero-shot templates verbatim, so the
cifar10 task reproduces the clip_benchmark zero-shot protocol. This CLIP checkpoint’s headline published figure is 66.6% zero-shot top-1 on ImageNet-1k (see the ImageNet page and the model card); a per-dataset CIFAR-10 figure should be cross-checked against clip_benchmark when adding new baselines.Per-model results
| Model | Rendering | Mill (top-1 acc) | Source |
|---|---|---|---|
ViT-B-32/laion2b_s34b_b79k | CLIP zero-shot | 93.56% ± 0.25 | open_clip / clip_benchmark protocol |
Qwen/Qwen3-VL-2B-Instruct | Generative MCQ | 95.87% ± 0.20 | Mill measurement (initial baseline) |