clotho_aqa_test setup: the clotho_aqa_test_filtered split, the single-word prompt, and case/punctuation-insensitive exact match. The dataset’s other rendering (clotho_asqa_test_v2) is scored by a GPT-4o judge and is intentionally not ported — exact match is reproducible without an external judge.
Evaluation configuration
| Hyperparameter | Value |
|---|---|
| Benchmark / task | clotho_aqa |
| Dataset | lmms-lab/ClothoAQA, config clotho_aqa (clotho_aqa_test_filtered, ~1,442 samples) |
| Prompt | <question> + "Answer the question using a single word only." |
| n-shots | 0 |
| Task type | GENERATIVE_QA |
| Metric | clotho_aqa_exact_match (case/punctuation-insensitive) |
| Max new tokens | 8 |
| Backend | HuggingFace audio-language model (hf) |
Reproduce
Qwen/Qwen2-Audio-7B-Instruct requires the audio extra (pip install librosa soundfile).
Results
Qwen2-Audio-7B-Instruct, 0-shot:
Mill
73.44% ± 1.16
exact match (strict)
exact match (strict)
lmms-eval 0.3
75.87%
GPT-Eval (LLM judge)
GPT-Eval (LLM judge)
Difference
−2.43 pts
different metric
different metric
lmms-eval’s 0.3 release reports 75.87% for Qwen2-Audio-Instruct on Clotho-AQA. Note the metric: although the release’s dataset table lists Clotho-AQA’s nominal metric as Accuracy, the actual 75.87% is published in its alignment-check table under a column headed
GPT-Eval — i.e. it is a GPT-4o judge score, not the case/punctuation-insensitive exact-match accuracy that Mill computes on the 1,442-sample filtered test. Mill’s exact match is strictly harsher than an LLM judge, so Mill’s 73.44% landing ~2.4 points below is the expected direction, and the two corroborate each other. Because the metrics differ, this is a cross-check rather than an identical-metric reproduction; Mill records its exact-match score as the baseline. For further context, the original Clotho-AQA paper reports ≈62.7% binary-answer accuracy for its supervised LSTM baseline on the full (unfiltered) set.Per-model results
| Model | Mill (clotho_aqa_exact_match) | Reported (lmms-eval 0.3, GPT-Eval) | Source |
|---|---|---|---|
Qwen/Qwen2-Audio-7B-Instruct | 73.44% ± 1.16 | 75.87% | lmms-eval 0.3 |