Dedup Utility (dedup)
Overview
dedup removes near-duplicate items using Maximal Marginal Relevance (MMR) so downstream evaluation operates on a compact, diverse set of units.
Why this matters in evaluation pipelines:
- Generated answers often repeat the same fact across adjacent chunks.
- Repeated claims inflate judge prompts without adding new information.
- Redundant evaluation steps can make GEval scoring noisy and expensive.
- Duplicate-heavy inputs reduce interpretability of per-item verdicts.
MMR is a relevance-vs-diversity reranking strategy originally introduced for retrieval and summarization. In nexa-gauge’s dedup utility, each item has:
- semantic similarity (via embedding cosine similarity)
- a relevance proxy (
Item.confidence)
The utility keeps high-value items while filtering candidates that are too similar to already selected ones. This preserves coverage while minimizing repetition.
Core references:
- Carbonell & Goldstein (SIGIR 1998), original MMR formulation and objective balancing relevance with novelty:
https://www.cs.cmu.edu/afs/.cs.cmu.edu/Web/People/jgc/publication/MMR_DiversityBased_Reranking_SIGIR_1998.pdf - Goldstein & Carbonell (TIPSTER 1998), MMR for summarization reranking and evaluation context:
https://aclanthology.org/X98-1025/ - Goldstein et al. (NAACL Workshop 2000), multi-document sentence extraction with redundancy control patterns:
https://aclanthology.org/W00-0405/ - Bennani-Smires et al. (2018), embedding-based MMR diversification in modern text selection settings:
https://arxiv.org/abs/1801.04470
In practical terms, dedup helps reduce prompt size and repeated scoring work for branches such as grounding, relevance, and potentially GEval step sets if reused there.
Use Case
Common use cases for dedup:
- Multiple chunks yield semantically equivalent claims (for example “Paris is the capital of France.” and “France’s capital is Paris.”).
- Repeated paraphrases from long generations create unnecessary judge calls downstream.
- User-provided GEval
evaluation_stepscontain near-duplicate checks for the same metric. - Any list of semantically similar
Items needs diversity preservation with minimal information loss. - Prompt budget reduction by removing redundant units before LLM judging.
Node Overview
In nexa-gauge, DedupNode wraps core MMR deduplication:
- Core algorithm:
nexa_gauge_core/dedup/mmr.py - Graph node wrapper:
nexa_gauge_graph/nodes/dedup.py
What it does:
- Input:
list[Item] - Embeds item text using local sentence-transformer model (
config.EMBEDDING_MODEL) - Starts from highest-confidence item
- Iteratively applies MMR scoring (
lambda * relevance - (1-lambda) * max_similarity) - Drops candidates above similarity threshold to selected items
- Returns:
- deduplicated
items dedup_mapof dropped index -> kept representative indexdroppedcount- zero token/cost accounting (
CostEstimateis 0.0 in current implementation)
- deduplicated
Current graph wiring:
- Implemented path:
scan -> chunk -> claims -> dedup - The same utility can also be reused for other
Itemlists (for example GEval steps) at application level.
Execution Flow
Optional reuse path (same utility over a different Item list):
Input
DedupNode.run(...) consumes list[Item].
Example claim-like items:
[
{
"text": "Hamlet's central theme is mortality.",
"tokens": 7,
"confidence": 0.94
},
{
"text": "The central theme of Hamlet is mortality.",
"tokens": 8,
"confidence": 0.91
},
{
"text": "Hamlet's indecision is driven by philosophical reflection.",
"tokens": 9,
"confidence": 0.89
}
]Example GEval-step-like items (same input type):
[
{
"text": "Check whether all key claims are supported by the provided context.",
"tokens": 13,
"confidence": 0.9
},
{
"text": "Verify each major claim is grounded in the context.",
"tokens": 11,
"confidence": 0.88
}
]Inputs used by the node:
item.textfor embedding and similarityitem.confidenceas relevance signal in MMR ranking
Output
Primary output type:
DedupArtifactsitems: list[Item]dropped: intdedup_map: dict[int, int]cost: CostEstimate
Example output:
{
"items": [
{
"id": "a13f0d6d113a0f0f",
"text": "Hamlet's central theme is mortality.",
"tokens": 7.0,
"confidence": 0.94,
"cached": false
},
{
"id": "de2dc4a5f0bb9810",
"text": "Hamlet's indecision is driven by philosophical reflection.",
"tokens": 9.0,
"confidence": 0.89,
"cached": false
}
],
"dropped": 1,
"dedup_map": {
"1": 0
},
"cost": {
"cost": 0.0,
"input_tokens": 0.0,
"output_tokens": 0.0
}
}Attribute meaning:
items: deduplicated kept items (representatives)dropped: how many input items were removed as duplicatesdedup_map: for each dropped input index, the kept representative index it mapped tocost.cost: node cost in USD (currently zero)cost.input_tokens,cost.output_tokens: currently zero (no LLM token metering in this node)
Note: JSON object keys are strings, so dedup_map keys may appear as "1" even though logical type is int.
Usage
OUTPUT_DIR=./out/dedup
mkdir -p "$OUTPUT_DIR"Estimate Cost
nexagauge estimate dedup \
--input ./sample.json \
--limit 5 \
| tee "$OUTPUT_DIR/dedup-estimate.txt"Note: estimate does not expose a native --output-dir flag; use redirection/tee to save output.
Run Evaluation
nexagauge run dedup \
--input ./sample.json \
--limit 5 \
--output-dir "$OUTPUT_DIR"