CIFAR-10 - Mill

CIFAR-10 is a 10-class, 10K-image classification benchmark. Mill ships it in two renderings of the same data and runs whichever one your model supports:

cifar10 — CLIP-style zero-shot classification: the image is scored against the 10 class names by image–text similarity, ensembling 18 prompt templates per class (classnames and templates copied verbatim from clip_benchmark).
cifar10_mcq — generative multiple-choice for vision-language models: the model sees the image and the 10 classes as lettered options (shuffled per image) and answers with a letter, which is parsed and graded.

Evaluation configuration

Hyperparameter	Value
Benchmark	`cifar10` (auto-picks `cifar10` for CLIP, `cifar10_mcq` for VLMs)
Dataset	`haideraltahan/wds_cifar10` (`test`, 10,000 images)
n-shots	`0`
Task type	`ZERO_SHOT_CLASSIFICATION` (CLIP) / `MULTIPLE_CHOICE` generative (VLM)
Metric	`acc` (top-1) / `cifar10_mcq_acc`
Backend	open_clip (`clip`) / HuggingFace VLM (`hf`)

Reproduce

# CLIP-style zero-shot
mill --output_dir ./results eval \
  "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" cifar10

# Vision-language model (generative MCQ)
mill --output_dir ./results eval \
  "Qwen/Qwen3-VL-2B-Instruct[dtype=bfloat16]" cifar10 --seed 42

mill --output_dir ./results collect --metric acc

Results

CLIP — ViT-B-32 (laion2b)

93.56% ± 0.25
zero-shot, top-1

Qwen3-VL-2B-Instruct

95.87% ± 0.20
generative MCQ

Mill copies clip_benchmark’s CIFAR-10 class names and 18 zero-shot templates verbatim, so the cifar10 task reproduces the clip_benchmark zero-shot protocol. This CLIP checkpoint’s headline published figure is 66.6% zero-shot top-1 on ImageNet-1k (see the ImageNet page and the model card); a per-dataset CIFAR-10 figure should be cross-checked against clip_benchmark when adding new baselines.

Per-model results

Model	Rendering	Mill (top-1 `acc`)	Source
`ViT-B-32/laion2b_s34b_b79k`	CLIP zero-shot	93.56% ± 0.25	open_clip / clip_benchmark protocol
`Qwen/Qwen3-VL-2B-Instruct`	Generative MCQ	95.87% ± 0.20	Mill measurement (initial baseline)

​Evaluation configuration

​Reproduce

​Results

CLIP — ViT-B-32 (laion2b)

Qwen3-VL-2B-Instruct

​Per-model results

Evaluation configuration

Reproduce

Results

Per-model results