Skip to main content
ESC-50 is a 50-class environmental sound classification benchmark: 2,000 five-second clips spanning animals, natural soundscapes, human non-speech, interior, and exterior/urban sounds. Mill runs two renderings of the same 2,000 clips, auto-selected per model:
  • esc50zero-shot for CLAP-style audio-text encoders: each clip is scored against the 50 class names by audio-text similarity, using the standard This is a sound of {c}. prompt.
  • esc50_generative — a generative rendering for audio-language models: the model hears the clip and the 50 candidate categories and answers with a category name, which is matched back to a class.
Dataset: ashraq/esc50; benchmark by Piczak (2015). No train/test split is needed — the encoder isn’t trained, so all 2,000 clips are scored.

Evaluation configuration

HyperparameterValue
Benchmarkesc50 (auto-picks esc50 for CLAP, esc50_generative for audio-LMs)
Datasetashraq/esc50 (train, 2,000 clips, 50 classes)
n-shots0
Task typeZERO_SHOT_CLASSIFICATION (CLAP) / GENERATIVE_QA (audio-LM)
PromptThis is a sound of {c}. (CLAP) / list-of-categories instruction (audio-LM)
Metricacc / esc50_gen_acc
BackendCLAP (clap) / HuggingFace audio-LM (hf)

Reproduce

# CLAP zero-shot
mill --output_dir ./results eval mill/models/configs/clap/clap_htsat_unfused.py esc50
mill --output_dir ./results collect --metric acc

# Audio-language model (generative category naming)
mill --output_dir ./results eval mill/models/configs/qwen/qwen2_audio_7b.py esc50
mill --output_dir ./results collect --metric esc50_gen_acc

Results

CLAP — clap-htsat-unfused

91.55% ± 0.62
zero-shot audio-text similarity

Qwen2-Audio-7B-Instruct

46.55% ± 1.12
generative category naming
Published ESC-50 zero-shot for the laion/clap-htsat-unfused checkpoint is ≈94.76% (MAEB) — not to be confused with the original CLAP paper’s 82.6%, which is a different (Microsoft) model. Mill measures 91.55%, matching LAION-CLAP’s official zero-shot protocol exactly: the same This is a sound of {c}. prompt, L2-normalized embeddings, and rand_trunc/repeatpad audio preprocessing. Prompt ensembling, int16 quantization, and lower precision were each tested and do not close the gap (they slightly lower it). The residual is the documented accuracy drop of the HuggingFace-converted CLAP weights versus the original laion_clap checkpoint (LAION-AI/CLAP #126) — intrinsic to the checkpoint Mill evaluates. The generative rendering has no published baseline, so Qwen2-Audio’s 46.55% is recorded as an initial baseline. Both are 50-way (chance = 2%).

Per-model results

ModelRenderingMillReference
laion/clap-htsat-unfusedZero-shot (acc)91.55% ± 0.62≈94.76% (MAEB); gap is the HF-weight-conversion drop (#126)
Qwen/Qwen2-Audio-7B-InstructGenerative (esc50_gen_acc)46.55% ± 1.12Mill measurement (initial baseline)