Claims (`claims`)

Overview

The claims utility converts a free-form generation into atomic, verifiable claim units. This is a core evaluation primitive: claim-level decomposition lets you score faithfulness, relevance, and hallucination risk with much higher precision than whole-response scoring.

Why this matters:

Long answers often mix correct and incorrect content. A single “good/bad” label hides this mixture.
Claim-level units create a stable interface between generation and downstream judges (for example relevance and grounding).
Decomposed claims improve traceability: each verdict can point back to one claim and its source chunk.

This design is strongly aligned with prior work:

FActScore argues factuality should be measured over atomic facts rather than coarse response-level judgments.
https://arxiv.org/abs/2305.14251
RAGAS operationalizes component-level evaluation in RAG pipelines and uses statement-level judging for answer quality dimensions.
https://arxiv.org/abs/2309.15217
Knowledge-Centric Hallucination Detection (RefChecker) shows fine-grained claim representations outperform coarser granularities for hallucination detection.
https://aclanthology.org/2024.emnlp-main.395/
LLMs-as-Judges survey summarizes why structured, task-specific judging pipelines (including decomposed units) improve practical evaluation quality and scalability.
https://arxiv.org/html/2412.05579v2

In nexa-gauge, claims extracts the single most important atomic claim per chunk (with confidence), then outputs ClaimArtifacts. These artifacts become the direct substrate for downstream metrics such as grounding and relevance.

Use Case

Use claims when you need:

Fine-grained hallucination analysis (not just response-level pass/fail)
Better signal for relevance and grounding metrics
Explainable evaluation outputs tied to specific claim units
Stable regression comparisons across prompt/model changes
Downstream metric pipelines that require normalized factual units

Node Overview

In nexa-gauge, claims sits in the preprocessing branch:

scan -> chunk -> claims

What it does:

Reads chunked generation text (Chunk list)
Calls an LLM extractor per chunk with a structured schema
Produces Claim objects with:
- extracted claim text
- source chunk index
- extractor confidence
- token count metadata
Aggregates per-chunk token/cost usage into one CostEstimate
Returns ClaimArtifacts(claims=[...], cost=...)

Implementation note:

Prompt asks for exactly one atomic claim per chunk, returned as JSON (claims[], confidences[]).

Execution Flow

Graph

Rendering diagram...

Input

Using your sample input:

json

{
  "case_id": "shakespeare-hamlet-short",
  "generation": "The central theme of Hamlet is mortality and the paralysis that arises from contemplating it. Through the famous 'To be or not to be' soliloquy and repeated encounters with death — the Ghost, Yorick's skull, Ophelia's drowning — Shakespeare explores how consciousness of death impedes decisive action. Hamlet's indecision stems not from cowardice but from his philosophical nature: he cannot act without questioning the meaning and consequences of every action."
}

Fields used by the claims branch:

generation: required; this is chunked and then converted to claims
case_id: used for case identity/reporting, not claim extraction logic

Direct node-level input to ClaimExtractorNode.run(...) is:

chunks: list[Chunk] (produced by upstream chunk node from generation)

Fields not required for claims:

question, context, reference

Output

Primary output type:

ClaimArtifacts
- claims: list[Claim]
- cost: CostEstimate

Example output:

json

{
  "claims": [
    {
      "item": {
        "id": "9df6db8c5d0c9a41",
        "text": "A central theme of Hamlet is mortality and its effect on action.",
        "tokens": 15.0,
        "confidence": 1.0,
        "cached": false
      },
      "source_chunk_index": 0,
      "confidence": 0.91,
      "extraction_failed": false
    },
    {
      "item": {
        "id": "f0f53c3c0119530d",
        "text": "Hamlet's indecision is tied to philosophical reflection rather than simple cowardice.",
        "tokens": 16.0,
        "confidence": 1.0,
        "cached": false
      },
      "source_chunk_index": 1,
      "confidence": 0.88,
      "extraction_failed": false
    }
  ],
  "cost": {
    "cost": 0.00074,
    "input_tokens": 240.0,
    "output_tokens": 56.0
  }
}

Attribute meaning:

claims: extracted claim units across all generation chunks
claims[].item.id: auto-generated hash-based ID from claim text
claims[].item.text: normalized atomic claim text
claims[].item.tokens: token count for that claim text
claims[].item.confidence: Item-level confidence (default type field)
claims[].item.cached: cache marker field on Item
claims[].source_chunk_index: originating chunk index
claims[].confidence: extractor confidence score for that claim (0–1)
claims[].extraction_failed: extraction failure flag (false for valid claims)
cost.cost: total USD cost for all claim extraction calls
cost.input_tokens: summed prompt tokens across chunks
cost.output_tokens: summed completion tokens across chunks

Usage

bash

OUTPUT_DIR=./out/claims
mkdir -p "$OUTPUT_DIR"

Estimate Cost

bash

nexagauge estimate claims \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/claims-estimate.txt"

Note: estimate supports --input and --limit; it does not provide a native --output-dir flag, so tee is used to write output into your chosen output directory.

Run Evaluation

bash

nexagauge run claims \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"

Claims (claims)

Overview

Use Case

Node Overview

Execution Flow

Input

Output

Usage

Estimate Cost

Run Evaluation

Claims (`claims`)