Reference (`reference`)

Overview

reference is a lexical overlap evaluation node that compares a model generation to a gold reference answer using ROUGE, BLEU, and METEOR style metrics.

The metric family comes from established summarization and MT evaluation work: BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and ROUGE (Lin, 2004). In practice, these metrics provide fast, deterministic signals of lexical/phrase overlap between candidate and reference text.

In nexa-gauge, reference computes five scores in [0,1]:

rouge1 (unigram overlap)
rouge2 (bigram overlap)
rougeL (longest common subsequence style overlap)
bleu
meteor

Unlike judge-model metrics, this node does not call an LLM and always reports zero cost. It is most useful as a baseline similarity signal, typically combined with semantic/judge-based metrics for fuller quality assessment.

Use Case

Use reference when you have trusted reference answers and want fast overlap-based quality checks.

Regression checks for answer fidelity against a gold target
Benchmark scoring where deterministic, low-latency metrics are needed
Sanity checking summarization/QA outputs before deeper judge-based evaluation
Comparing model variants with a consistent lexical baseline
Cost-sensitive pipelines that need non-LLM metrics

Node Overview

In nexa-gauge, reference is an answer metric node on this branch.

What the node does:

Reads normalized generation and reference text
Skips when reference is missing/blank (returns empty metrics, zero cost)
Computes ROUGE-1/2/L (F1), BLEU (smoothed sentence BLEU), and METEOR
Returns one MetricResult per metric
Returns zero-cost CostEstimate because no model calls are made

Execution Flow

Graph

Rendering diagram...

Input

Using your sample input:

json

{
  "case_id": "bitcoin-economics-medium",
  "question": "What is Bitcoin and how does it work as a currency?",
  "generation": "Bitcoin is a decentralised digital currency created in 2009 by the pseudonymous Satoshi Nakamoto. Unlike traditional currencies issued by central banks, Bitcoin operates on a peer-to-peer network with no central authority. ....",
  "reference": "Bitcoin is a decentralised digital currency launched in 2009, using blockchain technology and proof-of-work mining to verify transactions without a central authority. Its supply is capped at 21 million coins."
}

Fields used by the reference node:

generation: candidate text to score
reference: target text to compare against

Fields not used for scoring in this node:

question
case_id (used for report identity, not metric computation)

Output

Primary output type for this node is ReferenceMetrics (nexa_gauge_core/types.py).

metrics: list[MetricResult]
cost: CostEstimate

Note: RelevanceMetrics has a similar outer shape, but reference.py returns ReferenceMetrics.

Example output:

json

{
  "metrics": [
    {
      "name": "rouge1",
      "category": "answer",
      "score": 0.7063,
      "result": null,
      "error": null
    },
    {
      "name": "rouge2",
      "category": "answer",
      "score": 0.4921,
      "result": null,
      "error": null
    },
    {
      "name": "rougeL",
      "category": "answer",
      "score": 0.6554,
      "result": null,
      "error": null
    },
    {
      "name": "bleu",
      "category": "answer",
      "score": 0.3712,
      "result": null,
      "error": null
    },
    {
      "name": "meteor",
      "category": "answer",
      "score": 0.5987,
      "result": null,
      "error": null
    }
  ],
  "cost": {
    "cost": 0.0,
    "input_tokens": null,
    "output_tokens": null
  }
}

Attribute meaning:

metrics: metric results produced by this node (five when reference is present)
name: metric identifier (rouge1, rouge2, rougeL, bleu, meteor)
category: answer
score: metric value in [0,1] (higher is better overlap)
result: unused for these lexical metrics (null)
error: null on success; populated only if a metric-level failure occurs
cost.cost: always 0.0
cost.input_tokens, cost.output_tokens: always null (no LLM usage)

Usage

bash

OUTPUT_DIR=./out/reference
mkdir -p "$OUTPUT_DIR"

CLI: Estimate Cost

bash

nexagauge estimate reference \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/reference_estimate.txt"

estimate supports --input and --limit; it does not expose a native --output-dir option, so redirect/tee is used with OUTPUT_DIR.

CLI: Run Evaluation

bash

nexagauge run reference \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

For full per-case report JSON across all branches:

bash

nexagauge run eval \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

Reference (reference)

Overview

Use Case

Node Overview

Execution Flow

Input

Output

Usage

CLI: Estimate Cost

CLI: Run Evaluation

Reference (`reference`)