esc50— zero-shot for CLAP-style audio-text encoders: each clip is scored against the 50 class names by audio-text similarity, using the standardThis is a sound of {c}.prompt.esc50_generative— a generative rendering for audio-language models: the model hears the clip and the 50 candidate categories and answers with a category name, which is matched back to a class.
ashraq/esc50; benchmark by Piczak (2015). No train/test split is needed — the encoder isn’t trained, so all 2,000 clips are scored.
Evaluation configuration
| Hyperparameter | Value |
|---|---|
| Benchmark | esc50 (auto-picks esc50 for CLAP, esc50_generative for audio-LMs) |
| Dataset | ashraq/esc50 (train, 2,000 clips, 50 classes) |
| n-shots | 0 |
| Task type | ZERO_SHOT_CLASSIFICATION (CLAP) / GENERATIVE_QA (audio-LM) |
| Prompt | This is a sound of {c}. (CLAP) / list-of-categories instruction (audio-LM) |
| Metric | acc / esc50_gen_acc |
| Backend | CLAP (clap) / HuggingFace audio-LM (hf) |
Reproduce
Results
CLAP — clap-htsat-unfused
91.55% ± 0.62
zero-shot audio-text similarity
zero-shot audio-text similarity
Qwen2-Audio-7B-Instruct
46.55% ± 1.12
generative category naming
generative category naming
Published ESC-50 zero-shot for the
laion/clap-htsat-unfused checkpoint is ≈94.76% (MAEB) — not to be confused with the original CLAP paper’s 82.6%, which is a different (Microsoft) model. Mill measures 91.55%, matching LAION-CLAP’s official zero-shot protocol exactly: the same This is a sound of {c}. prompt, L2-normalized embeddings, and rand_trunc/repeatpad audio preprocessing. Prompt ensembling, int16 quantization, and lower precision were each tested and do not close the gap (they slightly lower it). The residual is the documented accuracy drop of the HuggingFace-converted CLAP weights versus the original laion_clap checkpoint (LAION-AI/CLAP #126) — intrinsic to the checkpoint Mill evaluates. The generative rendering has no published baseline, so Qwen2-Audio’s 46.55% is recorded as an initial baseline. Both are 50-way (chance = 2%).