Skip to main content
CIFAR-10 is a 10-class, 10K-image classification benchmark. Mill ships it in two renderings of the same data and runs whichever one your model supports:
  • cifar10 — CLIP-style zero-shot classification: the image is scored against the 10 class names by image–text similarity, ensembling 18 prompt templates per class (classnames and templates copied verbatim from clip_benchmark).
  • cifar10_mcqgenerative multiple-choice for vision-language models: the model sees the image and the 10 classes as lettered options (shuffled per image) and answers with a letter, which is parsed and graded.

Evaluation configuration

HyperparameterValue
Benchmarkcifar10 (auto-picks cifar10 for CLIP, cifar10_mcq for VLMs)
Datasethaideraltahan/wds_cifar10 (test, 10,000 images)
n-shots0
Task typeZERO_SHOT_CLASSIFICATION (CLIP) / MULTIPLE_CHOICE generative (VLM)
Metricacc (top-1) / cifar10_mcq_acc
Backendopen_clip (clip) / HuggingFace VLM (hf)

Reproduce

# CLIP-style zero-shot
mill --output_dir ./results eval \
  "clip[path=ViT-B-32,pretrained=laion2b_s34b_b79k]" cifar10

# Vision-language model (generative MCQ)
mill --output_dir ./results eval \
  "Qwen/Qwen3-VL-2B-Instruct[dtype=bfloat16]" cifar10 --seed 42

mill --output_dir ./results collect --metric acc

Results

CLIP — ViT-B-32 (laion2b)

93.56% ± 0.25
zero-shot, top-1

Qwen3-VL-2B-Instruct

95.87% ± 0.20
generative MCQ
Mill copies clip_benchmark’s CIFAR-10 class names and 18 zero-shot templates verbatim, so the cifar10 task reproduces the clip_benchmark zero-shot protocol. This CLIP checkpoint’s headline published figure is 66.6% zero-shot top-1 on ImageNet-1k (see the ImageNet page and the model card); a per-dataset CIFAR-10 figure should be cross-checked against clip_benchmark when adding new baselines.

Per-model results

ModelRenderingMill (top-1 acc)Source
ViT-B-32/laion2b_s34b_b79kCLIP zero-shot93.56% ± 0.25open_clip / clip_benchmark protocol
Qwen/Qwen3-VL-2B-InstructGenerative MCQ95.87% ± 0.20Mill measurement (initial baseline)