Semchunk (semchunk)

Overview

semchunk splits generated text into bounded, token-aware chunks while trying to keep each chunk semantically coherent.

This matters because downstream evaluation nodes should not have to reason over one long, unstructured generation. Chunking creates a stable intermediate representation: each chunk is small enough for extraction and judging, but still large enough to preserve useful local context.

The upstream Semchunk library is designed for fast semantic chunking. Its chunkerify() API can build a chunker from OpenAI model names, tiktoken encodings, Hugging Face tokenizers, tokenizer objects, or a custom token-counting function. nexa-gauge uses the custom token counter path so chunk sizes match the same internal token accounting used by scan and cost estimation.

Semchunk's algorithm recursively splits text until chunks fit the requested token size, preferring higher-level separators first. In practical terms, it tries paragraph/newline boundaries before whitespace, sentence punctuation, clause punctuation, word joiners, and finally individual characters. It then merges undersized pieces back together where possible, which gives better chunk shape than fixed-width character splitting.

In nexa-gauge, semchunk is not a scoring metric. It is a zero-cost utility node that prepares generation text for later nodes such as claims, refiner, grounding, and relevance.

Use Case

Use semchunk when you need generation text split into reliable evaluation units:

  • Long-form generations that exceed a convenient extraction window
  • Claim extraction pipelines that need one bounded text span at a time
  • Grounding and relevance runs where verdicts should trace back to source spans
  • Debugging or reporting workflows that need chunk indexes and character offsets
  • Cost estimation paths where chunk count drives downstream LLM-call estimates

Node Overview (nexa-gauge)

In nexa-gauge, semchunk is the implementation behind the chunk utility node on this branch:

generation -> scan -> semchunk

What it does:

  • Receives normalized generation text from scanner inputs
  • Uses the configured chunker strategy; currently only semchunk is supported
  • Counts generation tokens with nexa-gauge's shared token counter
  • Returns one unchanged chunk when the generation is below the split threshold
  • For longer generations, calls semchunk.chunkerify(_count_tokens, chunk_size)
  • Emits Chunk records with:
    • sequential chunk index
    • chunk text
    • token count
    • source character start/end offsets
    • SHA-256 hash of the chunk text
  • Reports zero model cost because no LLM call is made

Relevant constants:

  • GENERATION_CHUNK_SIZE_TOKENS = 100
  • CHUNK_MIN_TOKENS_FOR_SPLIT = 100
  • DEFAULT_CHUNKER_STRATEGY = "semchunk"

Skip behavior:

  • If generation is unavailable, the graph treats chunk as ineligible and emits the configured empty utility artifact.
  • If the selected chunker is not semchunk, the graph raises an unsupported-strategy error.

Execution Flow

Graph
Rendering diagram...

Input

Using your sample input:

json
{
  "case_id": "hamlet-long-answer",
  "question": "What is the central theme of Hamlet?",
  "generation": "The central theme of Hamlet is mortality and the paralysis that arises from contemplating it. Through the famous 'To be or not to be' soliloquy and repeated encounters with death, Shakespeare explores how consciousness of death impedes decisive action. Hamlet's hesitation is shaped by grief, uncertainty, philosophy, and the political danger around him.",
  "context": "Hamlet is a tragedy by William Shakespeare...",
  "reference": "Hamlet explores death, uncertainty, revenge, and the difficulty of action."
}

Fields used by the chunk branch:

  • generation: used as the source text to split
  • case_id: used for case identity/reporting, not chunking logic

Fields not used by semchunk:

  • question: not used by chunk (used by relevance)
  • context: not used by chunk (used by grounding)
  • reference: not used by chunk (used by reference)

Direct node signature in code: run(item: Item) -> ChunkArtifacts.

Output

Primary output type:

  • ChunkArtifacts
    • chunks: list[Chunk]
    • cost: CostEstimate

Example output:

json
{
  "chunks": [
    {
      "index": 0,
      "item": {
        "id": "bfe5e7f03ba02203",
        "text": "The central theme of Hamlet is mortality and the paralysis that arises from contemplating it. Through the famous 'To be or not to be' soliloquy and repeated encounters with death, Shakespeare explores how consciousness of death impedes decisive action.",
        "tokens": 45.0,
        "confidence": 1.0,
        "cached": false
      },
      "char_start": 0,
      "char_end": 252,
      "sha256": "bfe5e7f03ba02203d7949a1b5d6ee92b9ff6b46a7b3ecdc77b0a2a218ca3da78"
    },
    {
      "index": 1,
      "item": {
        "id": "a7f10228b5420eda",
        "text": "Hamlet's hesitation is shaped by grief, uncertainty, philosophy, and the political danger around him.",
        "tokens": 17.0,
        "confidence": 1.0,
        "cached": false
      },
      "char_start": 253,
      "char_end": 354,
      "sha256": "a7f10228b5420edaeb4f51fa53ef4b54ca209f200b668dd1d08f7ed732c6e3df"
    }
  ],
  "cost": {
    "cost": 0,
    "input_tokens": 0,
    "output_tokens": 0
  }
}

Attribute meaning:

  • chunks: ordered spans produced from the generation
  • chunks[].index: zero-based chunk position
  • chunks[].item.id: auto-generated hash-based ID from chunk text
  • chunks[].item.text: chunk text passed to downstream nodes
  • chunks[].item.tokens: token count for the chunk text
  • chunks[].item.confidence: default Item confidence field
  • chunks[].item.cached: cache marker field on Item
  • chunks[].char_start: start offset in the original generation
  • chunks[].char_end: end offset in the original generation
  • chunks[].sha256: full SHA-256 digest of the chunk text
  • cost.cost: always 0 for this node, as no external llm call
  • cost.input_tokens, cost.output_tokens: always 0 because chunking is local

Usage

bash
OUTPUT_DIR=./out/semchunk
mkdir -p "$OUTPUT_DIR"

Estimate Cost

bash
nexagauge estimate chunk \
  --input ./sample.json \
  --limit 5 

Note: the chunk estimate itself is zero-cost. The useful output is the chunk artifact and the downstream estimate impact of how many chunks were produced.

Run Utility

bash
nexagauge run chunk \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR" \
  --chunker semchunk

For a full evaluation run that uses Semchunk before refinement, claim extraction, and metrics:

bash
nexagauge run eval \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR" \
  --chunker semchunk