Clotho-AQA - Mill

Clotho-AQA is a crowdsourced audio question-answering benchmark: each 15–30 s sound clip is paired with a question whose answer is a single word (frequently yes/no). An audio-language model listens to the clip and answers in one word; the answer is graded by exact match after lowercasing and stripping punctuation. Mill ports the lmms-eval clotho_aqa_test setup: the clotho_aqa_test_filtered split, the single-word prompt, and case/punctuation-insensitive exact match. The dataset’s other rendering (clotho_asqa_test_v2) is scored by a GPT-4o judge and is intentionally not ported — exact match is reproducible without an external judge.

Evaluation configuration

Hyperparameter	Value
Benchmark / task	`clotho_aqa`
Dataset	`lmms-lab/ClothoAQA`, config `clotho_aqa` (`clotho_aqa_test_filtered`, ~1,442 samples)
Prompt	`<question>` + `"Answer the question using a single word only."`
n-shots	`0`
Task type	`GENERATIVE_QA`
Metric	`clotho_aqa_exact_match` (case/punctuation-insensitive)
Max new tokens	`8`
Backend	HuggingFace audio-language model (`hf`)

Reproduce

mill --output_dir ./results eval \
  mill/models/configs/qwen/qwen2_audio_7b.py clotho_aqa

mill --output_dir ./results collect --metric clotho_aqa_exact_match

Qwen/Qwen2-Audio-7B-Instruct requires the audio extra (pip install librosa soundfile).

Results

Qwen2-Audio-7B-Instruct, 0-shot:

Mill

73.44% ± 1.16
exact match (strict)

lmms-eval 0.3

75.87%
GPT-Eval (LLM judge)

Difference

−2.43 pts
different metric

lmms-eval’s 0.3 release reports 75.87% for Qwen2-Audio-Instruct on Clotho-AQA. Note the metric: although the release’s dataset table lists Clotho-AQA’s nominal metric as Accuracy, the actual 75.87% is published in its alignment-check table under a column headed GPT-Eval — i.e. it is a GPT-4o judge score, not the case/punctuation-insensitive exact-match accuracy that Mill computes on the 1,442-sample filtered test. Mill’s exact match is strictly harsher than an LLM judge, so Mill’s 73.44% landing ~2.4 points below is the expected direction, and the two corroborate each other. Because the metrics differ, this is a cross-check rather than an identical-metric reproduction; Mill records its exact-match score as the baseline. For further context, the original Clotho-AQA paper reports ≈62.7% binary-answer accuracy for its supervised LSTM baseline on the full (unfiltered) set.

Per-model results

Model	Mill (`clotho_aqa_exact_match`)	Reported (lmms-eval 0.3, GPT-Eval)	Source
`Qwen/Qwen2-Audio-7B-Instruct`	73.44% ± 1.16	75.87%	lmms-eval 0.3

​Evaluation configuration

​Reproduce

​Results