Prerequisites
- Python 3.10+
- A GPU (for local HF/vLLM models) or an API key (for LiteLLM backends)
1. Install
Install straight from GitHub — no clone required:2. Run a text evaluation
./results/ when done.
3. View results
collect command renders a table of scores in your terminal. Pass --metric to choose which metric to show — MMLU reports acc:
4. Browse available tasks
↑ ↓ to navigate, Tab to switch between Benchmarks and Tasks, and Enter to copy a task name to your clipboard.
Next steps
Text evaluation guide
Few-shot, custom metrics, and n-shot sweeps.
Vision evaluation guide
Multimodal models with image/video inputs.
Distributed scheduling
Scale across a SLURM cluster.
CLI reference
Full flag documentation.