Skip to main content
Clotho-AQA is a crowdsourced audio question-answering benchmark: each 15–30 s sound clip is paired with a question whose answer is a single word (frequently yes/no). An audio-language model listens to the clip and answers in one word; the answer is graded by exact match after lowercasing and stripping punctuation. Mill ports the lmms-eval clotho_aqa_test setup: the clotho_aqa_test_filtered split, the single-word prompt, and case/punctuation-insensitive exact match. The dataset’s other rendering (clotho_asqa_test_v2) is scored by a GPT-4o judge and is intentionally not ported — exact match is reproducible without an external judge.

Evaluation configuration

HyperparameterValue
Benchmark / taskclotho_aqa
Datasetlmms-lab/ClothoAQA, config clotho_aqa (clotho_aqa_test_filtered, ~1,442 samples)
Prompt<question> + "Answer the question using a single word only."
n-shots0
Task typeGENERATIVE_QA
Metricclotho_aqa_exact_match (case/punctuation-insensitive)
Max new tokens8
BackendHuggingFace audio-language model (hf)

Reproduce

mill --output_dir ./results eval \
  mill/models/configs/qwen/qwen2_audio_7b.py clotho_aqa

mill --output_dir ./results collect --metric clotho_aqa_exact_match
Qwen/Qwen2-Audio-7B-Instruct requires the audio extra (pip install librosa soundfile).

Results

Qwen2-Audio-7B-Instruct, 0-shot:

Mill

73.44% ± 1.16
exact match (strict)

lmms-eval 0.3

75.87%
GPT-Eval (LLM judge)

Difference

−2.43 pts
different metric
lmms-eval’s 0.3 release reports 75.87% for Qwen2-Audio-Instruct on Clotho-AQA. Note the metric: although the release’s dataset table lists Clotho-AQA’s nominal metric as Accuracy, the actual 75.87% is published in its alignment-check table under a column headed GPT-Eval — i.e. it is a GPT-4o judge score, not the case/punctuation-insensitive exact-match accuracy that Mill computes on the 1,442-sample filtered test. Mill’s exact match is strictly harsher than an LLM judge, so Mill’s 73.44% landing ~2.4 points below is the expected direction, and the two corroborate each other. Because the metrics differ, this is a cross-check rather than an identical-metric reproduction; Mill records its exact-match score as the baseline. For further context, the original Clotho-AQA paper reports ≈62.7% binary-answer accuracy for its supervised LSTM baseline on the full (unfiltered) set.

Per-model results

ModelMill (clotho_aqa_exact_match)Reported (lmms-eval 0.3, GPT-Eval)Source
Qwen/Qwen2-Audio-7B-Instruct73.44% ± 1.1675.87%lmms-eval 0.3