Reference (reference)

Overview

reference is a lexical overlap evaluation node that compares a model generation to a gold reference answer using ROUGE, BLEU, and METEOR style metrics.

The metric family comes from established summarization and MT evaluation work: BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and ROUGE (Lin, 2004). In practice, these metrics provide fast, deterministic signals of lexical/phrase overlap between candidate and reference text.

In nexa-gauge, reference computes five scores in [0,1]:

  • rouge1 (unigram overlap)
  • rouge2 (bigram overlap)
  • rougeL (longest common subsequence style overlap)
  • bleu
  • meteor

Unlike judge-model metrics, this node does not call an LLM and always reports zero cost. It is most useful as a baseline similarity signal, typically combined with semantic/judge-based metrics for fuller quality assessment.

Use Case

Use reference when you have trusted reference answers and want fast overlap-based quality checks.

  • Regression checks for answer fidelity against a gold target
  • Benchmark scoring where deterministic, low-latency metrics are needed
  • Sanity checking summarization/QA outputs before deeper judge-based evaluation
  • Comparing model variants with a consistent lexical baseline
  • Cost-sensitive pipelines that need non-LLM metrics

Node Overview

In nexa-gauge, reference is an answer metric node on this branch.

What the node does:

  • Reads normalized generation and reference text
  • Skips when reference is missing/blank (returns empty metrics, zero cost)
  • Computes ROUGE-1/2/L (F1), BLEU (smoothed sentence BLEU), and METEOR
  • Returns one MetricResult per metric
  • Returns zero-cost CostEstimate because no model calls are made

Execution Flow

Graph
Rendering diagram...

Input

Using your sample input:

json
{
  "case_id": "bitcoin-economics-medium",
  "question": "What is Bitcoin and how does it work as a currency?",
  "generation": "Bitcoin is a decentralised digital currency created in 2009 by the pseudonymous Satoshi Nakamoto. Unlike traditional currencies issued by central banks, Bitcoin operates on a peer-to-peer network with no central authority. ....",
  "reference": "Bitcoin is a decentralised digital currency launched in 2009, using blockchain technology and proof-of-work mining to verify transactions without a central authority. Its supply is capped at 21 million coins."
}

Fields used by the reference node:

  • generation: candidate text to score
  • reference: target text to compare against

Fields not used for scoring in this node:

  • question
  • case_id (used for report identity, not metric computation)

Output

Primary output type for this node is ReferenceMetrics (nexa_gauge_core/types.py).

  • metrics: list[MetricResult]
  • cost: CostEstimate

Note: RelevanceMetrics has a similar outer shape, but reference.py returns ReferenceMetrics.

Example output:

json
{
  "metrics": [
    {
      "name": "rouge1",
      "category": "answer",
      "score": 0.7063,
      "result": null,
      "error": null
    },
    {
      "name": "rouge2",
      "category": "answer",
      "score": 0.4921,
      "result": null,
      "error": null
    },
    {
      "name": "rougeL",
      "category": "answer",
      "score": 0.6554,
      "result": null,
      "error": null
    },
    {
      "name": "bleu",
      "category": "answer",
      "score": 0.3712,
      "result": null,
      "error": null
    },
    {
      "name": "meteor",
      "category": "answer",
      "score": 0.5987,
      "result": null,
      "error": null
    }
  ],
  "cost": {
    "cost": 0.0,
    "input_tokens": null,
    "output_tokens": null
  }
}

Attribute meaning:

  • metrics: metric results produced by this node (five when reference is present)
  • name: metric identifier (rouge1, rouge2, rougeL, bleu, meteor)
  • category: answer
  • score: metric value in [0,1] (higher is better overlap)
  • result: unused for these lexical metrics (null)
  • error: null on success; populated only if a metric-level failure occurs
  • cost.cost: always 0.0
  • cost.input_tokens, cost.output_tokens: always null (no LLM usage)

Usage

bash
OUTPUT_DIR=./out/reference
mkdir -p "$OUTPUT_DIR"

CLI: Estimate Cost

bash
nexagauge estimate reference \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/reference_estimate.txt"

estimate supports --input and --limit; it does not expose a native --output-dir option, so redirect/tee is used with OUTPUT_DIR.

CLI: Run Evaluation

bash
nexagauge run reference \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

For full per-case report JSON across all branches:

bash
nexagauge run eval \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5