Relevance (`relevance`)

Overview

Relevance measures whether an answer stays on-topic with the user’s question, at the claim level.

The idea is aligned with recent evaluation work:

RAGAS arXiv:2309.15217 emphasizes reference-free, component-level evaluation for RAG systems, including answer quality dimensions beyond final exact-match style scoring.
FActScore arXiv:2305.14251 shows why claim-level decomposition is important: one answer can contain a mix of good and bad statements, so per-claim judgment is more informative than one coarse label.
Judging LLM-as-a-Judge arXiv:2306.05685 supports using strong LLM judges for scalable automated evaluation, while highlighting bias risks and careful prompt/interpretation design.

In nexa-gauge, relevance follows this pattern by checking each extracted claim from the generation against the question and returning boolean verdicts (relevant / not relevant). The final score is the fraction of claims judged relevant.

This metric answers: “Did the model answer the question asked?” It does not measure factual support against evidence (that is grounding) and does not compare against a reference answer (that is reference metrics).

Use Case

Use relevance when you need to detect off-topic or partially on-topic responses:

QA systems where drift/off-topic content hurts UX
Agent outputs that tend to add unrelated details
Regression checks after prompt/model updates
Evaluation of concise answering behavior
Triage of answer quality before deeper factual checks

Node Overview

In nexa-gauge, relevance is an answer-category metric node.

What it does:

Uses claims extracted from the claims_extraction
Uses the question as relevance target
Calls the judge model with numbered claims and question
Expects structured output: {"verdicts": [true/false, ...]}
Maps per-claim verdicts to Relevancy entries:
- ACCEPTED for relevant
- REJECTED for not relevant
Computes score as: relevant_claims / total_claims

Skip behavior:

If claims are missing, relevance is disabled, or question is empty, returns empty metrics and zero cost.

Execution Flow

Graph

Rendering diagram...

Input

Using your sample input:

json

{
  "case_id": "eiffel-tower-basic",
  "question": "What is the Eiffel Tower and where is it located?",
  "generation": "The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. ......."
}

Fields used by the relevance branch:

generation: used upstream to produce claim_extraction
question: used directly by the relevance judge
case_id: used for case/report identity, not score computation

Fields not required by this node:

context is not needed for relevance scoring
reference is not needed for relevance scoring

Output

Primary output type:

RelevanceMetrics
- metrics: list[MetricResult]
- cost: CostEstimate

Example output:

json

{
  "metrics": [
    {
      "name": "answer_relevancy",
      "category": "answer",
      "score": 0.5,
      "result": [
        {
          "item": {
            "id": "11aa22bb33cc44dd",
            "text": "The Eiffel Tower is a wrought-iron lattice tower in Paris.",
            "tokens": 12.0,
            "confidence": 1.0,
            "cached": false
          },
          "source_chunk_index": 0,
          "confidence": 0.92,
          "extraction_failed": false,
          "verdict": "ACCEPTED"
        },
        {
          "item": {
            "id": "55ee66ff77gg88hh",
            "text": "Transformers use self-attention in deep learning.",
            "tokens": 9.0,
            "confidence": 1.0,
            "cached": false
          },
          "source_chunk_index": 1,
          "confidence": 0.85,
          "extraction_failed": false,
          "verdict": "REJECTED"
        }
      ],
      "error": null
    }
  ],
  "cost": {
    "cost": 0.00039,
    "input_tokens": 188.0,
    "output_tokens": 16.0
  }
}

Attribute meaning:

metrics: list of metric results for this node (empty when skipped)
name: metric identifier (answer_relevancy in current implementation)
category: answer
score: ratio of relevant claims in [0, 1]
result: per-claim relevance judgments (Relevancy)
result[].item: claim text and token metadata
result[].source_chunk_index: source generation chunk index
result[].confidence: claim extractor confidence
result[].extraction_failed: extraction failure flag
result[].verdict: ACCEPTED (relevant) or REJECTED (not relevant)
error: populated if judge output has no usable verdicts
cost.cost: USD cost for relevance evaluation
cost.input_tokens, cost.output_tokens: token usage for the judge call

Usage

bash

OUTPUT_DIR=./out/relevance
mkdir -p "$OUTPUT_DIR"

Estimate Cost

bash

nexagauge estimate relevance \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/relevance-estimate.txt"

Note: estimate supports --input and --limit; it does not expose a native --output-dir flag, so redirect/tee is used with OUTPUT_DIR.

Run Evaluation

bash

nexagauge run relevance \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"

For full aggregation/report files including all metrics:

bash

nexagauge run eval \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"

Relevance (relevance)

Overview

Use Case

Node Overview

Execution Flow

Input

Output

Usage

Estimate Cost

Run Evaluation

Relevance (`relevance`)