Semchunk (`semchunk`)

Overview

semchunk splits generated text into bounded, token-aware chunks while trying to keep each chunk semantically coherent.

This matters because downstream evaluation nodes should not have to reason over one long, unstructured generation. Chunking creates a stable intermediate representation: each chunk is small enough for extraction and judging, but still large enough to preserve useful local context.

The upstream Semchunk library is designed for fast semantic chunking. Its chunkerify() API can build a chunker from OpenAI model names, tiktoken encodings, Hugging Face tokenizers, tokenizer objects, or a custom token-counting function. nexa-gauge uses the custom token counter path so chunk sizes match the same internal token accounting used by scan and cost estimation.

Semchunk's algorithm recursively splits text until chunks fit the requested token size, preferring higher-level separators first. In practical terms, it tries paragraph/newline boundaries before whitespace, sentence punctuation, clause punctuation, word joiners, and finally individual characters. It then merges undersized pieces back together where possible, which gives better chunk shape than fixed-width character splitting.

In nexa-gauge, semchunk is not a scoring metric. It is a zero-cost utility node that prepares generation text for later nodes such as claims, refiner, grounding, and relevance.

Use Case

Use semchunk when you need generation text split into reliable evaluation units:

Long-form generations that exceed a convenient extraction window
Claim extraction pipelines that need one bounded text span at a time
Grounding and relevance runs where verdicts should trace back to source spans
Debugging or reporting workflows that need chunk indexes and character offsets
Cost estimation paths where chunk count drives downstream LLM-call estimates

Node Overview (nexa-gauge)

In nexa-gauge, semchunk is the implementation behind the chunk utility node on this branch:

generation -> scan -> semchunk

What it does:

Receives normalized generation text from scanner inputs
Uses the configured chunker strategy; currently only semchunk is supported
Counts generation tokens with nexa-gauge's shared token counter
Returns one unchanged chunk when the generation is below the split threshold
For longer generations, calls semchunk.chunkerify(_count_tokens, chunk_size)
Emits Chunk records with:
- sequential chunk index
- chunk text
- token count
- source character start/end offsets
- SHA-256 hash of the chunk text
Reports zero model cost because no LLM call is made

Relevant constants:

GENERATION_CHUNK_SIZE_TOKENS = 100
CHUNK_MIN_TOKENS_FOR_SPLIT = 100
DEFAULT_CHUNKER_STRATEGY = "semchunk"

Skip behavior:

If generation is unavailable, the graph treats chunk as ineligible and emits the configured empty utility artifact.
If the selected chunker is not semchunk, the graph raises an unsupported-strategy error.

Execution Flow

Graph

Rendering diagram...

Input

Using your sample input:

json

{
  "case_id": "hamlet-long-answer",
  "question": "What is the central theme of Hamlet?",
  "generation": "The central theme of Hamlet is mortality and the paralysis that arises from contemplating it. Through the famous 'To be or not to be' soliloquy and repeated encounters with death, Shakespeare explores how consciousness of death impedes decisive action. Hamlet's hesitation is shaped by grief, uncertainty, philosophy, and the political danger around him.",
  "context": "Hamlet is a tragedy by William Shakespeare...",
  "reference": "Hamlet explores death, uncertainty, revenge, and the difficulty of action."
}

Fields used by the chunk branch:

generation: used as the source text to split
case_id: used for case identity/reporting, not chunking logic

Fields not used by semchunk:

question: not used by chunk (used by relevance)
context: not used by chunk (used by grounding)
reference: not used by chunk (used by reference)

Direct node signature in code: run(item: Item) -> ChunkArtifacts.

Output

Primary output type:

ChunkArtifacts
- chunks: list[Chunk]
- cost: CostEstimate

Example output:

json

{
  "chunks": [
    {
      "index": 0,
      "item": {
        "id": "bfe5e7f03ba02203",
        "text": "The central theme of Hamlet is mortality and the paralysis that arises from contemplating it. Through the famous 'To be or not to be' soliloquy and repeated encounters with death, Shakespeare explores how consciousness of death impedes decisive action.",
        "tokens": 45.0,
        "confidence": 1.0,
        "cached": false
      },
      "char_start": 0,
      "char_end": 252,
      "sha256": "bfe5e7f03ba02203d7949a1b5d6ee92b9ff6b46a7b3ecdc77b0a2a218ca3da78"
    },
    {
      "index": 1,
      "item": {
        "id": "a7f10228b5420eda",
        "text": "Hamlet's hesitation is shaped by grief, uncertainty, philosophy, and the political danger around him.",
        "tokens": 17.0,
        "confidence": 1.0,
        "cached": false
      },
      "char_start": 253,
      "char_end": 354,
      "sha256": "a7f10228b5420edaeb4f51fa53ef4b54ca209f200b668dd1d08f7ed732c6e3df"
    }
  ],
  "cost": {
    "cost": 0,
    "input_tokens": 0,
    "output_tokens": 0
  }
}

Attribute meaning:

chunks: ordered spans produced from the generation
chunks[].index: zero-based chunk position
chunks[].item.id: auto-generated hash-based ID from chunk text
chunks[].item.text: chunk text passed to downstream nodes
chunks[].item.tokens: token count for the chunk text
chunks[].item.confidence: default Item confidence field
chunks[].item.cached: cache marker field on Item
chunks[].char_start: start offset in the original generation
chunks[].char_end: end offset in the original generation
chunks[].sha256: full SHA-256 digest of the chunk text
cost.cost: always 0 for this node, as no external llm call
cost.input_tokens, cost.output_tokens: always 0 because chunking is local

Usage

bash

OUTPUT_DIR=./out/semchunk
mkdir -p "$OUTPUT_DIR"

Estimate Cost

bash

nexagauge estimate chunk \
  --input ./sample.json \
  --limit 5

Note: the chunk estimate itself is zero-cost. The useful output is the chunk artifact and the downstream estimate impact of how many chunks were produced.

Run Utility

bash

nexagauge run chunk \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR" \
  --chunker semchunk

For a full evaluation run that uses Semchunk before refinement, claim extraction, and metrics:

bash

nexagauge run eval \
  --input ./sample.json \
  --limit 5 \
  --output-dir "$OUTPUT_DIR" \
  --chunker semchunk

Semchunk (semchunk)

Overview

Use Case

Node Overview (nexa-gauge)

Execution Flow

Input

Output

Usage

Estimate Cost

Run Utility

Semchunk (`semchunk`)