Claims (claims)
Overview
The claims utility converts a free-form generation into atomic, verifiable claim units. This is a core evaluation primitive: claim-level decomposition lets you score faithfulness, relevance, and hallucination risk with much higher precision than whole-response scoring.
Why this matters:
- Long answers often mix correct and incorrect content. A single “good/bad” label hides this mixture.
- Claim-level units create a stable interface between generation and downstream judges (for example relevance and grounding).
- Decomposed claims improve traceability: each verdict can point back to one claim and its source chunk.
This design is strongly aligned with prior work:
- FActScore argues factuality should be measured over atomic facts rather than coarse response-level judgments.
https://arxiv.org/abs/2305.14251 - RAGAS operationalizes component-level evaluation in RAG pipelines and uses statement-level judging for answer quality dimensions.
https://arxiv.org/abs/2309.15217 - Knowledge-Centric Hallucination Detection (RefChecker) shows fine-grained claim representations outperform coarser granularities for hallucination detection.
https://aclanthology.org/2024.emnlp-main.395/ - LLMs-as-Judges survey summarizes why structured, task-specific judging pipelines (including decomposed units) improve practical evaluation quality and scalability.
https://arxiv.org/html/2412.05579v2
In nexa-gauge, claims extracts the single most important atomic claim per chunk (with confidence), then outputs ClaimArtifacts. These artifacts become the direct substrate for downstream metrics such as grounding and relevance.
Use Case
Use claims when you need:
- Fine-grained hallucination analysis (not just response-level pass/fail)
- Better signal for relevance and grounding metrics
- Explainable evaluation outputs tied to specific claim units
- Stable regression comparisons across prompt/model changes
- Downstream metric pipelines that require normalized factual units
Node Overview
In nexa-gauge, claims sits in the preprocessing branch:
scan -> chunk -> claims
What it does:
- Reads chunked generation text (
Chunklist) - Calls an LLM extractor per chunk with a structured schema
- Produces
Claimobjects with:- extracted claim text
- source chunk index
- extractor confidence
- token count metadata
- Aggregates per-chunk token/cost usage into one
CostEstimate - Returns
ClaimArtifacts(claims=[...], cost=...)
Implementation note:
- Prompt asks for exactly one atomic claim per chunk, returned as JSON (
claims[],confidences[]).
Execution Flow
Input
Using your sample input:
{
"case_id": "shakespeare-hamlet-short",
"generation": "The central theme of Hamlet is mortality and the paralysis that arises from contemplating it. Through the famous 'To be or not to be' soliloquy and repeated encounters with death — the Ghost, Yorick's skull, Ophelia's drowning — Shakespeare explores how consciousness of death impedes decisive action. Hamlet's indecision stems not from cowardice but from his philosophical nature: he cannot act without questioning the meaning and consequences of every action."
}Fields used by the claims branch:
generation: required; this is chunked and then converted to claimscase_id: used for case identity/reporting, not claim extraction logic
Direct node-level input to ClaimExtractorNode.run(...) is:
chunks: list[Chunk](produced by upstreamchunknode fromgeneration)
Fields not required for claims:
question,context,reference
Output
Primary output type:
ClaimArtifactsclaims: list[Claim]cost: CostEstimate
Example output:
{
"claims": [
{
"item": {
"id": "9df6db8c5d0c9a41",
"text": "A central theme of Hamlet is mortality and its effect on action.",
"tokens": 15.0,
"confidence": 1.0,
"cached": false
},
"source_chunk_index": 0,
"confidence": 0.91,
"extraction_failed": false
},
{
"item": {
"id": "f0f53c3c0119530d",
"text": "Hamlet's indecision is tied to philosophical reflection rather than simple cowardice.",
"tokens": 16.0,
"confidence": 1.0,
"cached": false
},
"source_chunk_index": 1,
"confidence": 0.88,
"extraction_failed": false
}
],
"cost": {
"cost": 0.00074,
"input_tokens": 240.0,
"output_tokens": 56.0
}
}Attribute meaning:
claims: extracted claim units across all generation chunksclaims[].item.id: auto-generated hash-based ID from claim textclaims[].item.text: normalized atomic claim textclaims[].item.tokens: token count for that claim textclaims[].item.confidence:Item-level confidence (default type field)claims[].item.cached: cache marker field onItemclaims[].source_chunk_index: originating chunk indexclaims[].confidence: extractor confidence score for that claim (0–1)claims[].extraction_failed: extraction failure flag (false for valid claims)cost.cost: total USD cost for all claim extraction callscost.input_tokens: summed prompt tokens across chunkscost.output_tokens: summed completion tokens across chunks
Usage
OUTPUT_DIR=./out/claims
mkdir -p "$OUTPUT_DIR"Estimate Cost
nexagauge estimate claims \
--input ./sample.json \
--limit 5 \
| tee "$OUTPUT_DIR/claims-estimate.txt"Note: estimate supports --input and --limit; it does not provide a native --output-dir flag, so tee is used to write output into your chosen output directory.
Run Evaluation
nexagauge run claims \
--input ./sample.json \
--limit 5 \
--output-dir "$OUTPUT_DIR"