Relevance (relevance)

Overview

Relevance measures whether an answer stays on-topic with the user’s question, at the claim level.

The idea is aligned with recent evaluation work:

  • RAGAS arXiv:2309.15217 emphasizes reference-free, component-level evaluation for RAG systems, including answer quality dimensions beyond final exact-match style scoring.
  • FActScore arXiv:2305.14251 shows why claim-level decomposition is important: one answer can contain a mix of good and bad statements, so per-claim judgment is more informative than one coarse label.
  • Judging LLM-as-a-Judge arXiv:2306.05685 supports using strong LLM judges for scalable automated evaluation, while highlighting bias risks and careful prompt/interpretation design.

In nexa-gauge, relevance follows this pattern by checking each extracted claim from the generation against the question and returning boolean verdicts (relevant / not relevant). The final score is the fraction of claims judged relevant.

This metric answers: “Did the model answer the question asked?” It does not measure factual support against evidence (that is grounding) and does not compare against a reference answer (that is reference metrics).

Use Case

Use relevance when you need to detect off-topic or partially on-topic responses:

  • QA systems where drift/off-topic content hurts UX
  • Agent outputs that tend to add unrelated details
  • Regression checks after prompt/model updates
  • Evaluation of concise answering behavior
  • Triage of answer quality before deeper factual checks

Node Overview

In nexa-gauge, relevance is an answer-category metric node.

What it does:

  • Uses claims extracted from the claims_extraction
  • Uses the question as relevance target
  • Calls the judge model with numbered claims and question
  • Expects structured output: {"verdicts": [true/false, ...]}
  • Maps per-claim verdicts to Relevancy entries:
    • ACCEPTED for relevant
    • REJECTED for not relevant
  • Computes score as: relevant_claims / total_claims

Skip behavior:

  • If claims are missing, relevance is disabled, or question is empty, returns empty metrics and zero cost.

Execution Flow

Graph
Rendering diagram...

Input

Using your sample input:

json
{
  "case_id": "eiffel-tower-basic",
  "question": "What is the Eiffel Tower and where is it located?",
  "generation": "The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. ......."
}

Fields used by the relevance branch:

  • generation: used upstream to produce claim_extraction
  • question: used directly by the relevance judge
  • case_id: used for case/report identity, not score computation

Fields not required by this node:

  • context is not needed for relevance scoring
  • reference is not needed for relevance scoring

Output

Primary output type:

  • RelevanceMetrics
    • metrics: list[MetricResult]
    • cost: CostEstimate

Example output:

json
{
  "metrics": [
    {
      "name": "answer_relevancy",
      "category": "answer",
      "score": 0.5,
      "result": [
        {
          "item": {
            "id": "11aa22bb33cc44dd",
            "text": "The Eiffel Tower is a wrought-iron lattice tower in Paris.",
            "tokens": 12.0,
            "confidence": 1.0,
            "cached": false
          },
          "source_chunk_index": 0,
          "confidence": 0.92,
          "extraction_failed": false,
          "verdict": "ACCEPTED"
        },
        {
          "item": {
            "id": "55ee66ff77gg88hh",
            "text": "Transformers use self-attention in deep learning.",
            "tokens": 9.0,
            "confidence": 1.0,
            "cached": false
          },
          "source_chunk_index": 1,
          "confidence": 0.85,
          "extraction_failed": false,
          "verdict": "REJECTED"
        }
      ],
      "error": null
    }
  ],
  "cost": {
    "cost": 0.00039,
    "input_tokens": 188.0,
    "output_tokens": 16.0
  }
}

Attribute meaning:

  • metrics: list of metric results for this node (empty when skipped)
  • name: metric identifier (answer_relevancy in current implementation)
  • category: answer
  • score: ratio of relevant claims in [0, 1]
  • result: per-claim relevance judgments (Relevancy)
  • result[].item: claim text and token metadata
  • result[].source_chunk_index: source generation chunk index
  • result[].confidence: claim extractor confidence
  • result[].extraction_failed: extraction failure flag
  • result[].verdict: ACCEPTED (relevant) or REJECTED (not relevant)
  • error: populated if judge output has no usable verdicts
  • cost.cost: USD cost for relevance evaluation
  • cost.input_tokens, cost.output_tokens: token usage for the judge call

Usage

bash
OUTPUT_DIR=./out/relevance
mkdir -p "$OUTPUT_DIR"

Estimate Cost

bash
nexagauge estimate relevance \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/relevance-estimate.txt"

Note: estimate supports --input and --limit; it does not expose a native --output-dir flag, so redirect/tee is used with OUTPUT_DIR.

Run Evaluation

bash
nexagauge run relevance \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"

For full aggregation/report files including all metrics:

bash
nexagauge run eval \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"