Dedup Utility (dedup)

Overview

dedup removes near-duplicate items using Maximal Marginal Relevance (MMR) so downstream evaluation operates on a compact, diverse set of units.

Why this matters in evaluation pipelines:

  • Generated answers often repeat the same fact across adjacent chunks.
  • Repeated claims inflate judge prompts without adding new information.
  • Redundant evaluation steps can make GEval scoring noisy and expensive.
  • Duplicate-heavy inputs reduce interpretability of per-item verdicts.

MMR is a relevance-vs-diversity reranking strategy originally introduced for retrieval and summarization. In nexa-gauge’s dedup utility, each item has:

  • semantic similarity (via embedding cosine similarity)
  • a relevance proxy (Item.confidence)

The utility keeps high-value items while filtering candidates that are too similar to already selected ones. This preserves coverage while minimizing repetition.

Core references:

In practical terms, dedup helps reduce prompt size and repeated scoring work for branches such as grounding, relevance, and potentially GEval step sets if reused there.

Use Case

Common use cases for dedup:

  • Multiple chunks yield semantically equivalent claims (for example “Paris is the capital of France.” and “France’s capital is Paris.”).
  • Repeated paraphrases from long generations create unnecessary judge calls downstream.
  • User-provided GEval evaluation_steps contain near-duplicate checks for the same metric.
  • Any list of semantically similar Items needs diversity preservation with minimal information loss.
  • Prompt budget reduction by removing redundant units before LLM judging.

Node Overview

In nexa-gauge, DedupNode wraps core MMR deduplication:

  • Core algorithm: nexa_gauge_core/dedup/mmr.py
  • Graph node wrapper: nexa_gauge_graph/nodes/dedup.py

What it does:

  • Input: list[Item]
  • Embeds item text using local sentence-transformer model (config.EMBEDDING_MODEL)
  • Starts from highest-confidence item
  • Iteratively applies MMR scoring (lambda * relevance - (1-lambda) * max_similarity)
  • Drops candidates above similarity threshold to selected items
  • Returns:
    • deduplicated items
    • dedup_map of dropped index -> kept representative index
    • dropped count
    • zero token/cost accounting (CostEstimate is 0.0 in current implementation)

Current graph wiring:

  • Implemented path: scan -> chunk -> claims -> dedup
  • The same utility can also be reused for other Item lists (for example GEval steps) at application level.

Execution Flow

Graph
Rendering diagram...

Optional reuse path (same utility over a different Item list):

Graph
Rendering diagram...

Input

DedupNode.run(...) consumes list[Item].

Example claim-like items:

json
[
  {
    "text": "Hamlet's central theme is mortality.",
    "tokens": 7,
    "confidence": 0.94
  },
  {
    "text": "The central theme of Hamlet is mortality.",
    "tokens": 8,
    "confidence": 0.91
  },
  {
    "text": "Hamlet's indecision is driven by philosophical reflection.",
    "tokens": 9,
    "confidence": 0.89
  }
]

Example GEval-step-like items (same input type):

json
[
  {
    "text": "Check whether all key claims are supported by the provided context.",
    "tokens": 13,
    "confidence": 0.9
  },
  {
    "text": "Verify each major claim is grounded in the context.",
    "tokens": 11,
    "confidence": 0.88
  }
]

Inputs used by the node:

  • item.text for embedding and similarity
  • item.confidence as relevance signal in MMR ranking

Output

Primary output type:

  • DedupArtifacts
    • items: list[Item]
    • dropped: int
    • dedup_map: dict[int, int]
    • cost: CostEstimate

Example output:

json
{
  "items": [
    {
      "id": "a13f0d6d113a0f0f",
      "text": "Hamlet's central theme is mortality.",
      "tokens": 7.0,
      "confidence": 0.94,
      "cached": false
    },
    {
      "id": "de2dc4a5f0bb9810",
      "text": "Hamlet's indecision is driven by philosophical reflection.",
      "tokens": 9.0,
      "confidence": 0.89,
      "cached": false
    }
  ],
  "dropped": 1,
  "dedup_map": {
    "1": 0
  },
  "cost": {
    "cost": 0.0,
    "input_tokens": 0.0,
    "output_tokens": 0.0
  }
}

Attribute meaning:

  • items: deduplicated kept items (representatives)
  • dropped: how many input items were removed as duplicates
  • dedup_map: for each dropped input index, the kept representative index it mapped to
  • cost.cost: node cost in USD (currently zero)
  • cost.input_tokens, cost.output_tokens: currently zero (no LLM token metering in this node)

Note: JSON object keys are strings, so dedup_map keys may appear as "1" even though logical type is int.

Usage

bash
OUTPUT_DIR=./out/dedup
mkdir -p "$OUTPUT_DIR"

Estimate Cost

bash
nexagauge estimate dedup \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/dedup-estimate.txt"

Note: estimate does not expose a native --output-dir flag; use redirection/tee to save output.

Run Evaluation

bash
nexagauge run dedup \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR"