The skill lives in the repo at
.claude/skills/adding-a-benchmark/. In Claude Code, type /adding-a-benchmark; you can also read it as plain Markdown.adding-a-benchmark skill
The full, ordered checklist for porting a benchmark into Mill.
The process
- Find the source of truth — locate the benchmark’s repo and paper, and read how its official grader computes the metric (extraction logic, tie-handling, prompt, n-shot, split). Record the published baseline you’ll validate against.
- Write the task —
mill/tasks/<name>/task.pyexportingTASKS_TABLE(andBENCHMARKS_TABLE), with a prompt function, the righttask_type, and a metric that mirrors the upstream grader. See the Tasks reference. - Validate — run it and compare Mill’s score to the published number; a gap means the prompt/split/grader differs, not that the number needs “fixing”.
- Document — add a reproducibility page, an overview card, a tasks-table row, the
docs.jsonnav entry, and a changelog bullet.
Documentation
Docs are written as MDX underdocs/ and published to pymill.com. Edit the relevant .mdx files and add new pages to docs.json; a maintainer publishes the live site. Keep the reproducibility numbers in sync with aggregate.csv — values are stored as fractions there and shown as percentages in the docs.