> ## Documentation Index
> Fetch the complete documentation index at: https://pymill.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Clotho-AQA

> Clotho-AQA single-word audio question answering reproduced with Mill.

Clotho-AQA is a crowdsourced **audio** question-answering benchmark: each 15–30 s sound clip is paired with a question whose answer is a single word (frequently yes/no). An audio-language model listens to the clip and answers in one word; the answer is graded by **exact match** after lowercasing and stripping punctuation.

Mill ports the lmms-eval [`clotho_aqa_test`](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/tasks/clotho_aqa) setup: the `clotho_aqa_test_filtered` split, the single-word prompt, and case/punctuation-insensitive exact match. The dataset's other rendering (`clotho_asqa_test_v2`) is scored by a GPT-4o judge and is intentionally not ported — exact match is reproducible without an external judge.

## Evaluation configuration

| Hyperparameter   | Value                                                                                   |
| ---------------- | --------------------------------------------------------------------------------------- |
| Benchmark / task | `clotho_aqa`                                                                            |
| Dataset          | `lmms-lab/ClothoAQA`, config `clotho_aqa` (`clotho_aqa_test_filtered`, \~1,442 samples) |
| Prompt           | `<question>` + `"Answer the question using a single word only."`                        |
| n-shots          | `0`                                                                                     |
| Task type        | `GENERATIVE_QA`                                                                         |
| Metric           | `clotho_aqa_exact_match` (case/punctuation-insensitive)                                 |
| Max new tokens   | `8`                                                                                     |
| Backend          | HuggingFace audio-language model (`hf`)                                                 |

## Reproduce

```bash theme={null}
mill --output_dir ./results eval \
  mill/models/configs/qwen/qwen2_audio_7b.py clotho_aqa

mill --output_dir ./results collect --metric clotho_aqa_exact_match
```

`Qwen/Qwen2-Audio-7B-Instruct` requires the audio extra (`pip install librosa soundfile`).

## Results

`Qwen2-Audio-7B-Instruct`, 0-shot:

<CardGroup cols={3}>
  <Card title="Mill" icon="gauge">
    **73.44%** ± 1.16
    <br />exact match (strict)
  </Card>

  <Card title="lmms-eval 0.3" icon="book">
    **75.87%**
    <br />GPT-Eval (LLM judge)
  </Card>

  <Card title="Difference" icon="scale-balanced">
    **−2.43 pts**
    <br />different metric
  </Card>
</CardGroup>

<Note>
  lmms-eval's [0.3 release](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/releases/lmms-eval-0.3.md) reports **75.87%** for Qwen2-Audio-Instruct on Clotho-AQA. Note the metric: although the release's dataset table lists Clotho-AQA's nominal metric as *Accuracy*, the actual **75.87%** is published in its alignment-check table under a column headed **`GPT-Eval`** — i.e. it is a **GPT-4o judge** score, not the case/punctuation-insensitive exact-match accuracy that Mill computes on the 1,442-sample filtered test. Mill's exact match is strictly harsher than an LLM judge, so Mill's **73.44%** landing \~2.4 points below is the expected direction, and the two corroborate each other. Because the metrics differ, this is a cross-check rather than an identical-metric reproduction; Mill records its exact-match score as the baseline. For further context, the original [Clotho-AQA paper](https://arxiv.org/abs/2204.09634) reports ≈62.7% binary-answer accuracy for its supervised LSTM baseline on the full (unfiltered) set.
</Note>

### Per-model results

| Model                          | Mill (`clotho_aqa_exact_match`) | Reported (lmms-eval 0.3, GPT-Eval) | Source                                                                                                  |
| ------------------------------ | ------------------------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------- |
| `Qwen/Qwen2-Audio-7B-Instruct` | **73.44%** ± 1.16               | 75.87%                             | [lmms-eval 0.3](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/releases/lmms-eval-0.3.md) |
