Skip to main content
Mill tasks declare one of three OutputType values, which determines how the model is queried and what the response looks like.

GENERATIVEgenerate_until

The model generates text autoregressively until it hits a stop token or max_new_tokens.
Input:  "Question: What is 2+2?\nAnswer:"
Output: "4"                              ← model produces this
  • Used for: open-ended generation — math reasoning (GSM8K), summarization, code, VQA.
  • Response type: str
  • Scoring: compare the generated string to a gold answer with a metric like exact_match, regex extraction, or LLM-judge.

LOGPROBSloglikelihood

The model scores a fixed string by computing the log-probability of that string given a context. No generation happens — you score each candidate and pick the highest.
Context:     "Question: The capital of France is?\nAnswer:"
Candidate A: " Paris"   → log P = -0.3   ← highest → predicted answer
Candidate B: " London"  → log P = -4.1
Candidate C: " Berlin"  → log P = -3.8
Candidate D: " Madrid"  → log P = -5.2
  • Used for: multiple-choice tasks — MMLU, ARC, HellaSwag.
  • Response type: (float, bool)(total_log_prob, is_greedy_correct)
    • is_greedy_correct is True if the model’s argmax at every token position matched the continuation, enabling both log-prob accuracy and strict greedy accuracy from a single forward pass.
  • Why prefer this over generation for MCQ: avoids formatting failures where a model answers “The answer is (B)” instead of “B”, making scores more reliable and reproducible.

PERPLEXITYloglikelihood_rolling

The model scores the entire string as a sequence, computing the average log-probability per token across all tokens (no context/continuation split).
Input:  "The quick brown fox jumps over the lazy dog"
Output: -1.42  ← average log P per token across all tokens
  • Used for: perplexity benchmarks — measuring how well a model predicts a corpus.
  • Response type: float (average log-prob per token)
  • Aggregation: corpus_level_fn = lambda vals: 2 ** (-mean(vals)) converts to perplexity (bits).
  • Lower perplexity = better: the model finds the text more likely.
  • Typical datasets: WikiText-103, Penn Treebank, The Pile subsets.

Summary

OutputTypeMethodGeneration?ResponseTypical tasks
GENERATIVEgenerate_untilYesstrGSM8K, HumanEval, VQA, summarization
LOGPROBSloglikelihoodNo(float, bool)MMLU, ARC, HellaSwag, WinoGrande
PERPLEXITYloglikelihood_rollingNofloatWikiText, Penn Treebank, BPB benchmarks

How Mill uses OutputType

Each MillTaskConfig declares one output_type. The evaluator groups instances by type and dispatches to the corresponding model method:
# mill/evaluator.py
if OutputType.GENERATIVE in by_type:
    responses = model.generate_until(by_type[OutputType.GENERATIVE])

if OutputType.LOGPROBS in by_type:
    responses = model.loglikelihood(by_type[OutputType.LOGPROBS])

if OutputType.PERPLEXITY in by_type:
    responses = model.loglikelihood_rolling(by_type[OutputType.PERPLEXITY])
For LOGPROBS tasks, Mill creates one Instance per choice from the same Doc, then selects the choice with the highest log-prob as the model’s prediction before scoring metrics.