Reference (reference)
Overview
reference is a lexical overlap evaluation node that compares a model generation to a gold reference answer using ROUGE, BLEU, and METEOR style metrics.
The metric family comes from established summarization and MT evaluation work: BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and ROUGE (Lin, 2004). In practice, these metrics provide fast, deterministic signals of lexical/phrase overlap between candidate and reference text.
In nexa-gauge, reference computes five scores in [0,1]:
rouge1(unigram overlap)rouge2(bigram overlap)rougeL(longest common subsequence style overlap)bleumeteor
Unlike judge-model metrics, this node does not call an LLM and always reports zero cost. It is most useful as a baseline similarity signal, typically combined with semantic/judge-based metrics for fuller quality assessment.
Use Case
Use reference when you have trusted reference answers and want fast overlap-based quality checks.
- Regression checks for answer fidelity against a gold target
- Benchmark scoring where deterministic, low-latency metrics are needed
- Sanity checking summarization/QA outputs before deeper judge-based evaluation
- Comparing model variants with a consistent lexical baseline
- Cost-sensitive pipelines that need non-LLM metrics
Node Overview
In nexa-gauge, reference is an answer metric node on this branch.
What the node does:
- Reads normalized
generationandreferencetext - Skips when
referenceis missing/blank (returns empty metrics, zero cost) - Computes ROUGE-1/2/L (F1), BLEU (smoothed sentence BLEU), and METEOR
- Returns one
MetricResultper metric - Returns zero-cost
CostEstimatebecause no model calls are made
Execution Flow
Input
Using your sample input:
{
"case_id": "bitcoin-economics-medium",
"question": "What is Bitcoin and how does it work as a currency?",
"generation": "Bitcoin is a decentralised digital currency created in 2009 by the pseudonymous Satoshi Nakamoto. Unlike traditional currencies issued by central banks, Bitcoin operates on a peer-to-peer network with no central authority. ....",
"reference": "Bitcoin is a decentralised digital currency launched in 2009, using blockchain technology and proof-of-work mining to verify transactions without a central authority. Its supply is capped at 21 million coins."
}Fields used by the reference node:
generation: candidate text to scorereference: target text to compare against
Fields not used for scoring in this node:
questioncase_id(used for report identity, not metric computation)
Output
Primary output type for this node is ReferenceMetrics (nexa_gauge_core/types.py).
metrics: list[MetricResult]cost: CostEstimate
Note: RelevanceMetrics has a similar outer shape, but reference.py returns ReferenceMetrics.
Example output:
{
"metrics": [
{
"name": "rouge1",
"category": "answer",
"score": 0.7063,
"result": null,
"error": null
},
{
"name": "rouge2",
"category": "answer",
"score": 0.4921,
"result": null,
"error": null
},
{
"name": "rougeL",
"category": "answer",
"score": 0.6554,
"result": null,
"error": null
},
{
"name": "bleu",
"category": "answer",
"score": 0.3712,
"result": null,
"error": null
},
{
"name": "meteor",
"category": "answer",
"score": 0.5987,
"result": null,
"error": null
}
],
"cost": {
"cost": 0.0,
"input_tokens": null,
"output_tokens": null
}
}Attribute meaning:
metrics: metric results produced by this node (five when reference is present)name: metric identifier (rouge1,rouge2,rougeL,bleu,meteor)category:answerscore: metric value in[0,1](higher is better overlap)result: unused for these lexical metrics (null)error:nullon success; populated only if a metric-level failure occurscost.cost: always0.0cost.input_tokens,cost.output_tokens: alwaysnull(no LLM usage)
Usage
OUTPUT_DIR=./out/reference
mkdir -p "$OUTPUT_DIR"CLI: Estimate Cost
nexagauge estimate reference \
--input ./sample.json \
--limit 5 \
| tee "$OUTPUT_DIR/reference_estimate.txt"estimate supports --input and --limit; it does not expose a native --output-dir option, so redirect/tee is used with OUTPUT_DIR.
CLI: Run Evaluation
nexagauge run reference \
--input ./sample.json \
--output-dir "$OUTPUT_DIR" \
--limit 5For full per-case report JSON across all branches:
nexagauge run eval \
--input ./sample.json \
--output-dir "$OUTPUT_DIR" \
--limit 5