Skip to main content
UrbanSound8K is a 10-class urban sound classification benchmark: 8,732 short (≤4 s) field recordings spanning air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street music. Mill runs two renderings of the same 8,732 clips, auto-selected per model:
  • urbansound8kzero-shot for CLAP-style audio-text encoders: each clip is scored against the 10 class names by audio-text similarity, using the standard This is a sound of {c}. prompt.
  • urbansound8k_generative — a generative rendering for audio-language models: the model hears the clip and the 10 candidate categories and answers with a category name, which is matched back to a class.
Dataset: danavery/urbansound8K; benchmark by Salamon et al. (2014). UrbanSound8K ships as 10 folds for supervised cross-validation, but zero-shot needs no training split — the encoder isn’t trained, so all 8,732 clips are scored (matching how CLAP-family papers report US8K zero-shot).

Evaluation configuration

HyperparameterValue
Benchmarkurbansound8k (auto-picks urbansound8k for CLAP, urbansound8k_generative for audio-LMs)
Datasetdanavery/urbansound8K (train, 8,732 clips, 10 classes)
n-shots0
Task typeZERO_SHOT_CLASSIFICATION (CLAP) / GENERATIVE_QA (audio-LM)
PromptThis is a sound of {c}. (CLAP) / list-of-categories instruction (audio-LM)
Metricacc / urbansound8k_gen_acc
BackendCLAP (clap) / HuggingFace audio-LM (hf)

Reproduce

# CLAP zero-shot
mill --output_dir ./results eval mill/models/configs/clap/clap_htsat_unfused.py urbansound8k
mill --output_dir ./results collect --metric acc

# Audio-language model (generative category naming)
mill --output_dir ./results eval mill/models/configs/qwen/qwen2_audio_7b.py urbansound8k
mill --output_dir ./results collect --metric urbansound8k_gen_acc

Results

CLAP — clap-htsat-unfused

75.16% ± 0.46
zero-shot audio-text similarity
Published UrbanSound8K zero-shot for LAION-CLAP is ≈76–77% (LAION-CLAP), reported for the fused 630k-audioset checkpoint with the same This is a sound of {c}. prompt. Mill evaluates the HuggingFace laion/clap-htsat-unfused checkpoint and measures 75.16%. As with ESC-50, the residual versus the paper is the documented accuracy drop of the HuggingFace-converted CLAP weights relative to the original laion_clap checkpoint (LAION-AI/CLAP #126), compounded by the paper’s use of the stronger fused checkpoint. All numbers are 10-way (chance = 10%).

Per-model results

ModelRenderingMillReference
laion/clap-htsat-unfusedZero-shot (acc)75.16% ± 0.46≈76–77% (LAION-CLAP); gap is the fused-vs-unfused checkpoint + HF-weight-conversion drop (#126)