Semchunk (semchunk)
Overview
semchunk splits generated text into bounded, token-aware chunks while trying to keep each chunk semantically coherent.
This matters because downstream evaluation nodes should not have to reason over one long, unstructured generation. Chunking creates a stable intermediate representation: each chunk is small enough for extraction and judging, but still large enough to preserve useful local context.
The upstream Semchunk library is designed for fast semantic chunking. Its chunkerify() API can build a chunker from OpenAI model names, tiktoken encodings, Hugging Face tokenizers, tokenizer objects, or a custom token-counting function. nexa-gauge uses the custom token counter path so chunk sizes match the same internal token accounting used by scan and cost estimation.
Semchunk's algorithm recursively splits text until chunks fit the requested token size, preferring higher-level separators first. In practical terms, it tries paragraph/newline boundaries before whitespace, sentence punctuation, clause punctuation, word joiners, and finally individual characters. It then merges undersized pieces back together where possible, which gives better chunk shape than fixed-width character splitting.
In nexa-gauge, semchunk is not a scoring metric. It is a zero-cost utility node that prepares generation text for later nodes such as claims, refiner, grounding, and relevance.
Use Case
Use semchunk when you need generation text split into reliable evaluation units:
- Long-form generations that exceed a convenient extraction window
- Claim extraction pipelines that need one bounded text span at a time
- Grounding and relevance runs where verdicts should trace back to source spans
- Debugging or reporting workflows that need chunk indexes and character offsets
- Cost estimation paths where chunk count drives downstream LLM-call estimates
Node Overview (nexa-gauge)
In nexa-gauge, semchunk is the implementation behind the chunk utility node on this branch:
generation -> scan -> semchunk
What it does:
- Receives normalized generation text from scanner inputs
- Uses the configured
chunkerstrategy; currently onlysemchunkis supported - Counts generation tokens with nexa-gauge's shared token counter
- Returns one unchanged chunk when the generation is below the split threshold
- For longer generations, calls
semchunk.chunkerify(_count_tokens, chunk_size) - Emits
Chunkrecords with:- sequential chunk index
- chunk text
- token count
- source character start/end offsets
- SHA-256 hash of the chunk text
- Reports zero model cost because no LLM call is made
Relevant constants:
GENERATION_CHUNK_SIZE_TOKENS = 100CHUNK_MIN_TOKENS_FOR_SPLIT = 100DEFAULT_CHUNKER_STRATEGY = "semchunk"
Skip behavior:
- If generation is unavailable, the graph treats
chunkas ineligible and emits the configured empty utility artifact. - If the selected chunker is not
semchunk, the graph raises an unsupported-strategy error.
Execution Flow
Input
Using your sample input:
{
"case_id": "hamlet-long-answer",
"question": "What is the central theme of Hamlet?",
"generation": "The central theme of Hamlet is mortality and the paralysis that arises from contemplating it. Through the famous 'To be or not to be' soliloquy and repeated encounters with death, Shakespeare explores how consciousness of death impedes decisive action. Hamlet's hesitation is shaped by grief, uncertainty, philosophy, and the political danger around him.",
"context": "Hamlet is a tragedy by William Shakespeare...",
"reference": "Hamlet explores death, uncertainty, revenge, and the difficulty of action."
}Fields used by the chunk branch:
generation: used as the source text to splitcase_id: used for case identity/reporting, not chunking logic
Fields not used by semchunk:
question: not used bychunk(used byrelevance)context: not used bychunk(used bygrounding)reference: not used bychunk(used byreference)
Direct node signature in code: run(item: Item) -> ChunkArtifacts.
Output
Primary output type:
ChunkArtifactschunks: list[Chunk]cost: CostEstimate
Example output:
{
"chunks": [
{
"index": 0,
"item": {
"id": "bfe5e7f03ba02203",
"text": "The central theme of Hamlet is mortality and the paralysis that arises from contemplating it. Through the famous 'To be or not to be' soliloquy and repeated encounters with death, Shakespeare explores how consciousness of death impedes decisive action.",
"tokens": 45.0,
"confidence": 1.0,
"cached": false
},
"char_start": 0,
"char_end": 252,
"sha256": "bfe5e7f03ba02203d7949a1b5d6ee92b9ff6b46a7b3ecdc77b0a2a218ca3da78"
},
{
"index": 1,
"item": {
"id": "a7f10228b5420eda",
"text": "Hamlet's hesitation is shaped by grief, uncertainty, philosophy, and the political danger around him.",
"tokens": 17.0,
"confidence": 1.0,
"cached": false
},
"char_start": 253,
"char_end": 354,
"sha256": "a7f10228b5420edaeb4f51fa53ef4b54ca209f200b668dd1d08f7ed732c6e3df"
}
],
"cost": {
"cost": 0,
"input_tokens": 0,
"output_tokens": 0
}
}Attribute meaning:
chunks: ordered spans produced from the generationchunks[].index: zero-based chunk positionchunks[].item.id: auto-generated hash-based ID from chunk textchunks[].item.text: chunk text passed to downstream nodeschunks[].item.tokens: token count for the chunk textchunks[].item.confidence: defaultItemconfidence fieldchunks[].item.cached: cache marker field onItemchunks[].char_start: start offset in the original generationchunks[].char_end: end offset in the original generationchunks[].sha256: full SHA-256 digest of the chunk textcost.cost: always0for this node, as no external llm callcost.input_tokens,cost.output_tokens: always0because chunking is local
Usage
OUTPUT_DIR=./out/semchunk
mkdir -p "$OUTPUT_DIR"Estimate Cost
nexagauge estimate chunk \
--input ./sample.json \
--limit 5 Note: the chunk estimate itself is zero-cost. The useful output is the chunk artifact and the downstream estimate impact of how many chunks were produced.
Run Utility
nexagauge run chunk \
--input ./sample.json \
--limit 5 \
--output-dir "$OUTPUT_DIR" \
--chunker semchunkFor a full evaluation run that uses Semchunk before refinement, claim extraction, and metrics:
nexagauge run eval \
--input ./sample.json \
--limit 5 \
--output-dir "$OUTPUT_DIR" \
--chunker semchunk