Dedup Utility (`dedup`)

Overview

dedup removes near-duplicate items using Maximal Marginal Relevance (MMR) so downstream evaluation operates on a compact, diverse set of units.

Why this matters in evaluation pipelines:

Generated answers often repeat the same fact across adjacent chunks.
Repeated claims inflate judge prompts without adding new information.
Redundant evaluation steps can make GEval scoring noisy and expensive.
Duplicate-heavy inputs reduce interpretability of per-item verdicts.

MMR is a relevance-vs-diversity reranking strategy originally introduced for retrieval and summarization. In nexa-gauge’s dedup utility, each item has:

semantic similarity (via embedding cosine similarity)
a relevance proxy (Item.confidence)

The utility keeps high-value items while filtering candidates that are too similar to already selected ones. This preserves coverage while minimizing repetition.

Core references:

Carbonell & Goldstein (SIGIR 1998), original MMR formulation and objective balancing relevance with novelty:
https://www.cs.cmu.edu/afs/.cs.cmu.edu/Web/People/jgc/publication/MMR_DiversityBased_Reranking_SIGIR_1998.pdf
Goldstein & Carbonell (TIPSTER 1998), MMR for summarization reranking and evaluation context:
https://aclanthology.org/X98-1025/
Goldstein et al. (NAACL Workshop 2000), multi-document sentence extraction with redundancy control patterns:
https://aclanthology.org/W00-0405/
Bennani-Smires et al. (2018), embedding-based MMR diversification in modern text selection settings:
https://arxiv.org/abs/1801.04470

In practical terms, dedup helps reduce prompt size and repeated scoring work for branches such as grounding, relevance, and potentially GEval step sets if reused there.

Use Case

Common use cases for dedup:

Multiple chunks yield semantically equivalent claims (for example “Paris is the capital of France.” and “France’s capital is Paris.”).
Repeated paraphrases from long generations create unnecessary judge calls downstream.
User-provided GEval evaluation_steps contain near-duplicate checks for the same metric.
Any list of semantically similar Items needs diversity preservation with minimal information loss.
Prompt budget reduction by removing redundant units before LLM judging.

Node Overview

In nexa-gauge, DedupNode wraps core MMR deduplication:

Core algorithm: nexa_gauge_core/dedup/mmr.py
Graph node wrapper: nexa_gauge_graph/nodes/dedup.py

What it does:

Input: list[Item]
Embeds item text using local sentence-transformer model (config.EMBEDDING_MODEL)
Starts from highest-confidence item
Iteratively applies MMR scoring (lambda * relevance - (1-lambda) * max_similarity)
Drops candidates above similarity threshold to selected items
Returns:
- deduplicated items
- dedup_map of dropped index -> kept representative index
- dropped count
- zero token/cost accounting (CostEstimate is 0.0 in current implementation)

Current graph wiring:

Implemented path: scan -> chunk -> claims -> dedup
The same utility can also be reused for other Item lists (for example GEval steps) at application level.

Execution Flow

Graph

Rendering diagram...

Optional reuse path (same utility over a different Item list):

Graph

Rendering diagram...

Input

DedupNode.run(...) consumes list[Item].

Example claim-like items:

json

[
  {
    "text": "Hamlet's central theme is mortality.",
    "tokens": 7,
    "confidence": 0.94
  },
  {
    "text": "The central theme of Hamlet is mortality.",
    "tokens": 8,
    "confidence": 0.91
  },
  {
    "text": "Hamlet's indecision is driven by philosophical reflection.",
    "tokens": 9,
    "confidence": 0.89
  }
]

Example GEval-step-like items (same input type):

json

[
  {
    "text": "Check whether all key claims are supported by the provided context.",
    "tokens": 13,
    "confidence": 0.9
  },
  {
    "text": "Verify each major claim is grounded in the context.",
    "tokens": 11,
    "confidence": 0.88
  }
]

Inputs used by the node:

item.text for embedding and similarity
item.confidence as relevance signal in MMR ranking

Output

Primary output type:

DedupArtifacts
- items: list[Item]
- dropped: int
- dedup_map: dict[int, int]
- cost: CostEstimate

Example output:

json

{
  "items": [
    {
      "id": "a13f0d6d113a0f0f",
      "text": "Hamlet's central theme is mortality.",
      "tokens": 7.0,
      "confidence": 0.94,
      "cached": false
    },
    {
      "id": "de2dc4a5f0bb9810",
      "text": "Hamlet's indecision is driven by philosophical reflection.",
      "tokens": 9.0,
      "confidence": 0.89,
      "cached": false
    }
  ],
  "dropped": 1,
  "dedup_map": {
    "1": 0
  },
  "cost": {
    "cost": 0.0,
    "input_tokens": 0.0,
    "output_tokens": 0.0
  }
}

Attribute meaning:

items: deduplicated kept items (representatives)
dropped: how many input items were removed as duplicates
dedup_map: for each dropped input index, the kept representative index it mapped to
cost.cost: node cost in USD (currently zero)
cost.input_tokens, cost.output_tokens: currently zero (no LLM token metering in this node)

Note: JSON object keys are strings, so dedup_map keys may appear as "1" even though logical type is int.

Usage

bash

OUTPUT_DIR=./out/dedup
mkdir -p "$OUTPUT_DIR"

Estimate Cost

bash

nexagauge estimate dedup \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/dedup-estimate.txt"

Note: estimate does not expose a native --output-dir flag; use redirection/tee to save output.

Run Evaluation

bash

nexagauge run dedup \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"

Dedup Utility (dedup)

Overview

Use Case

Node Overview

Execution Flow

Input

Output

Usage

Estimate Cost

Run Evaluation

Dedup Utility (`dedup`)