Relevance (relevance)
Overview
Relevance measures whether an answer stays on-topic with the user’s question, at the claim level.
The idea is aligned with recent evaluation work:
- RAGAS arXiv:2309.15217 emphasizes reference-free, component-level evaluation for RAG systems, including answer quality dimensions beyond final exact-match style scoring.
- FActScore arXiv:2305.14251 shows why claim-level decomposition is important: one answer can contain a mix of good and bad statements, so per-claim judgment is more informative than one coarse label.
- Judging LLM-as-a-Judge arXiv:2306.05685 supports using strong LLM judges for scalable automated evaluation, while highlighting bias risks and careful prompt/interpretation design.
In nexa-gauge, relevance follows this pattern by checking each extracted claim from the generation against the question and returning boolean verdicts (relevant / not relevant). The final score is the fraction of claims judged relevant.
This metric answers: “Did the model answer the question asked?” It does not measure factual support against evidence (that is grounding) and does not compare against a reference answer (that is reference metrics).
Use Case
Use relevance when you need to detect off-topic or partially on-topic responses:
- QA systems where drift/off-topic content hurts UX
- Agent outputs that tend to add unrelated details
- Regression checks after prompt/model updates
- Evaluation of concise answering behavior
- Triage of answer quality before deeper factual checks
Node Overview
In nexa-gauge, relevance is an answer-category metric node.
What it does:
- Uses claims extracted from the
claims_extraction - Uses the
questionas relevance target - Calls the judge model with numbered claims and question
- Expects structured output:
{"verdicts": [true/false, ...]} - Maps per-claim verdicts to
Relevancyentries:ACCEPTEDfor relevantREJECTEDfor not relevant
- Computes score as:
relevant_claims / total_claims
Skip behavior:
- If claims are missing, relevance is disabled, or question is empty, returns empty metrics and zero cost.
Execution Flow
Input
Using your sample input:
{
"case_id": "eiffel-tower-basic",
"question": "What is the Eiffel Tower and where is it located?",
"generation": "The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. ......."
}Fields used by the relevance branch:
generation: used upstream to produceclaim_extractionquestion: used directly by therelevancejudgecase_id: used for case/report identity, not score computation
Fields not required by this node:
contextis not needed for relevance scoringreferenceis not needed for relevance scoring
Output
Primary output type:
RelevanceMetricsmetrics: list[MetricResult]cost: CostEstimate
Example output:
{
"metrics": [
{
"name": "answer_relevancy",
"category": "answer",
"score": 0.5,
"result": [
{
"item": {
"id": "11aa22bb33cc44dd",
"text": "The Eiffel Tower is a wrought-iron lattice tower in Paris.",
"tokens": 12.0,
"confidence": 1.0,
"cached": false
},
"source_chunk_index": 0,
"confidence": 0.92,
"extraction_failed": false,
"verdict": "ACCEPTED"
},
{
"item": {
"id": "55ee66ff77gg88hh",
"text": "Transformers use self-attention in deep learning.",
"tokens": 9.0,
"confidence": 1.0,
"cached": false
},
"source_chunk_index": 1,
"confidence": 0.85,
"extraction_failed": false,
"verdict": "REJECTED"
}
],
"error": null
}
],
"cost": {
"cost": 0.00039,
"input_tokens": 188.0,
"output_tokens": 16.0
}
}Attribute meaning:
metrics: list of metric results for this node (empty when skipped)name: metric identifier (answer_relevancyin current implementation)category:answerscore: ratio of relevant claims in[0, 1]result: per-claim relevance judgments (Relevancy)result[].item: claim text and token metadataresult[].source_chunk_index: source generation chunk indexresult[].confidence: claim extractor confidenceresult[].extraction_failed: extraction failure flagresult[].verdict:ACCEPTED(relevant) orREJECTED(not relevant)error: populated if judge output has no usable verdictscost.cost: USD cost for relevance evaluationcost.input_tokens,cost.output_tokens: token usage for the judge call
Usage
OUTPUT_DIR=./out/relevance
mkdir -p "$OUTPUT_DIR"Estimate Cost
nexagauge estimate relevance \
--input ./sample.json \
--limit 5 \
| tee "$OUTPUT_DIR/relevance-estimate.txt"Note: estimate supports --input and --limit; it does not expose a native --output-dir flag, so redirect/tee is used with OUTPUT_DIR.
Run Evaluation
nexagauge run relevance \
--input ./sample.json \
--limit 5 \
--output-dir "$OUTPUT_DIR"For full aggregation/report files including all metrics:
nexagauge run eval \
--input ./sample.json \
--limit 5 \
--output-dir "$OUTPUT_DIR"