OutputType values, which determines how the model is queried and what the response looks like.
GENERATIVE — generate_until
The model generates text autoregressively until it hits a stop token or max_new_tokens.
- Used for: open-ended generation — math reasoning (GSM8K), summarization, code, VQA.
- Response type:
str - Scoring: compare the generated string to a gold answer with a metric like
exact_match, regex extraction, or LLM-judge.
LOGPROBS — loglikelihood
The model scores a fixed string by computing the log-probability of that string given a context. No generation happens — you score each candidate and pick the highest.
- Used for: multiple-choice tasks — MMLU, ARC, HellaSwag.
- Response type:
(float, bool)—(total_log_prob, is_greedy_correct)is_greedy_correctisTrueif the model’s argmax at every token position matched the continuation, enabling both log-prob accuracy and strict greedy accuracy from a single forward pass.
- Why prefer this over generation for MCQ: avoids formatting failures where a model answers “The answer is (B)” instead of “B”, making scores more reliable and reproducible.
PERPLEXITY — loglikelihood_rolling
The model scores the entire string as a sequence, computing the average log-probability per token across all tokens (no context/continuation split).
- Used for: perplexity benchmarks — measuring how well a model predicts a corpus.
- Response type:
float(average log-prob per token) - Aggregation:
corpus_level_fn = lambda vals: 2 ** (-mean(vals))converts to perplexity (bits). - Lower perplexity = better: the model finds the text more likely.
- Typical datasets: WikiText-103, Penn Treebank, The Pile subsets.
Summary
OutputType | Method | Generation? | Response | Typical tasks |
|---|---|---|---|---|
GENERATIVE | generate_until | Yes | str | GSM8K, HumanEval, VQA, summarization |
LOGPROBS | loglikelihood | No | (float, bool) | MMLU, ARC, HellaSwag, WinoGrande |
PERPLEXITY | loglikelihood_rolling | No | float | WikiText, Penn Treebank, BPB benchmarks |
How Mill uses OutputType
EachMillTaskConfig declares one output_type. The evaluator groups instances by type and dispatches to the corresponding model method:
LOGPROBS tasks, Mill creates one Instance per choice from the same Doc, then selects the choice with the highest log-prob as the model’s prediction before scoring metrics.