ESC-50 - Mill

ESC-50 is a 50-class environmental sound classification benchmark: 2,000 five-second clips spanning animals, natural soundscapes, human non-speech, interior, and exterior/urban sounds. Mill runs two renderings of the same 2,000 clips, auto-selected per model:

esc50 — zero-shot for CLAP-style audio-text encoders: each clip is scored against the 50 class names by audio-text similarity, using the standard This is a sound of {c}. prompt.
esc50_generative — a generative rendering for audio-language models: the model hears the clip and the 50 candidate categories and answers with a category name, which is matched back to a class.

Dataset: ashraq/esc50; benchmark by Piczak (2015). No train/test split is needed — the encoder isn’t trained, so all 2,000 clips are scored.

Evaluation configuration

Hyperparameter	Value
Benchmark	`esc50` (auto-picks `esc50` for CLAP, `esc50_generative` for audio-LMs)
Dataset	`ashraq/esc50` (`train`, 2,000 clips, 50 classes)
n-shots	`0`
Task type	`ZERO_SHOT_CLASSIFICATION` (CLAP) / `GENERATIVE_QA` (audio-LM)
Prompt	`This is a sound of {c}.` (CLAP) / list-of-categories instruction (audio-LM)
Metric	`acc` / `esc50_gen_acc`
Backend	CLAP (`clap`) / HuggingFace audio-LM (`hf`)

Reproduce

# CLAP zero-shot
mill --output_dir ./results eval mill/models/configs/clap/clap_htsat_unfused.py esc50
mill --output_dir ./results collect --metric acc

# Audio-language model (generative category naming)
mill --output_dir ./results eval mill/models/configs/qwen/qwen2_audio_7b.py esc50
mill --output_dir ./results collect --metric esc50_gen_acc

Results

CLAP — clap-htsat-unfused

91.55% ± 0.62
zero-shot audio-text similarity

Qwen2-Audio-7B-Instruct

46.55% ± 1.12
generative category naming

Published ESC-50 zero-shot for the laion/clap-htsat-unfused checkpoint is ≈94.76% (MAEB) — not to be confused with the original CLAP paper’s 82.6%, which is a different (Microsoft) model. Mill measures 91.55%, matching LAION-CLAP’s official zero-shot protocol exactly: the same This is a sound of {c}. prompt, L2-normalized embeddings, and rand_trunc/repeatpad audio preprocessing. Prompt ensembling, int16 quantization, and lower precision were each tested and do not close the gap (they slightly lower it). The residual is the documented accuracy drop of the HuggingFace-converted CLAP weights versus the original laion_clap checkpoint (LAION-AI/CLAP #126) — intrinsic to the checkpoint Mill evaluates. The generative rendering has no published baseline, so Qwen2-Audio’s 46.55% is recorded as an initial baseline. Both are 50-way (chance = 2%).

Per-model results

Model	Rendering	Mill	Reference
`laion/clap-htsat-unfused`	Zero-shot (`acc`)	91.55% ± 0.62	≈94.76% (MAEB); gap is the HF-weight-conversion drop (#126)
`Qwen/Qwen2-Audio-7B-Instruct`	Generative (`esc50_gen_acc`)	46.55% ± 1.12	Mill measurement (initial baseline)

​Evaluation configuration

​Reproduce

​Results