GEval Steps (geval_steps)
Overview
The geval_steps node implements the “auto-evaluation-steps” part of the G-Eval approach: turn a high-level criterion into concrete, testable evaluation steps before scoring output quality.
In G-Eval (arXiv:2303.16634), the core idea is to improve LLM-as-a-judge reliability by combining: task criteria, chain-of-thought style intermediate evaluation steps, and structured scoring. nexa-gauge mirrors this by generating 2-3 measurable steps from criteria when explicit evaluation_steps are not provided.
This gives two practical benefits for evaluation pipelines:
- You can define concise criteria once, and let the system synthesize consistent step-level rubrics.
- You can cache generated steps by signature (criteria + model + prompt/parser version), so repeated runs avoid extra LLM cost.
The node does not assign final metric scores. It prepares resolved steps that downstream geval scoring uses. Metrics that already include evaluation_steps bypass generation and are passed through as-is.
Use Case
Use geval_steps when you want rubric-driven evaluation without manually writing steps for every metric.
- Standardize judge instructions from short criteria text
- Reduce prompt-writing overhead for many custom GEval metrics
- Reuse cached step artifacts across repeated runs
- Mix explicit step metrics and criteria-only metrics in one config
- Keep scoring behavior consistent by resolving all metrics to step lists first
Node Overview
In nexa-gauge, geval_steps is a preprocessing node for the GEval branch.
- Input source:
inputs.geval.metricsfrom scanner-normalized record input - For each GEval metric:
- If
evaluation_stepsexists and is non-empty: marksteps_source="provided" - Else if
criteriaexists: resolve from cache or generate via LLM - Else: skip metric (no criteria and no steps)
- If
- Output object:
GevalStepsArtifacts(resolved steps + step-generation cost)
Important behavior:
- Duplicate metric names get deterministic keys (
name#1,name#2, ...). - Only criteria-based metrics trigger step generation.
- Metrics with user-provided
evaluation_stepsdo not call the step-generation prompt.
Execution Flow
Input
Using your sample input, these fields are used by the geval_steps branch:
{
"case_id": "gpt-rag-explanation-large",
"question": "Explain retrieval-augmented generation (RAG) and when it should be used over fine-tuning.",
"generation": "Retrieval-Augmented Generation (RAG) is a technique that combines ....",
"geval": {
"metrics": [
{
"name": "rag_concept_coverage",
"item_fields": ["question", "generation"],
"criteria": "The answer clearly explains what retrieval-augmented generation is and when to prefer it over fine-tuning."
},
{
"name": "retrieval_pipeline_steps",
"item_fields": ["question", "generation"],
"evaluation_steps": [
"Verify the answer describes embedding the query and retrieving semantically similar passages.",
"Verify the answer explains that retrieved passages are injected into the model context before generation.",
"Verify the answer does not claim RAG completely eliminates hallucinations."
]
},
{
"name": "reference_alignment",
"item_fields": ["generation", "reference"],
"criteria": "The answer must align with the provided reference answer."
}
]
}
} generation: used as branch eligibility gate (has_generationmust be true)geval.metrics[*].name: used in outputnameand dedup-safekeygeval.metrics[*].item_fields: copied to resolved artifactsgeval.metrics[*].evaluation_steps: if present, used directly (no LLM generation)geval.metrics[*].criteria: used only whenevaluation_stepsis empty
Field-by-field for your three sample metrics:
rag_concept_coverage: usescriteria-> generates (or cache-loads) stepsretrieval_pipeline_steps: uses providedevaluation_stepsdirectlyreference_alignment: usescriteria-> generates (or cache-loads) steps
Not used by geval_steps for generation logic:
question,reference, text content themselves- These are used later by
gevalscoring, not by step resolution
Output
For this node, the concrete output type is GevalStepsArtifacts.
resolved_steps: list[GevalStepsResolved]cost: CostEstimate
Example output:
{
"resolved_steps": [
{
"key": "rag_concept_coverage",
"name": "rag_concept_coverage",
"item_fields": ["question", "generation"],
"evaluation_steps": [
{
"id": "f3e87c1c2e45d214",
"text": "Check whether the answer defines RAG as retrieval + generation over external knowledge.",
"tokens": 17.0,
"confidence": 1.0,
"cached": false
},
{
"id": "6b9b8692a6cf05fa",
"text": "Check whether the answer explains when RAG is preferable to fine-tuning (freshness, source grounding, lower retraining cost).",
"tokens": 22.0,
"confidence": 1.0,
"cached": false
}
],
"steps_source": "generated",
"signature": "a3f9f8cb9b2a40cde15a734a"
},
{
"key": "retrieval_pipeline_steps",
"name": "retrieval_pipeline_steps",
"item_fields": ["question", "generation"],
"evaluation_steps": [
{
"id": "247f5bf2293f6051",
"text": "Verify the answer describes embedding the query and retrieving semantically similar passages.",
"tokens": 14.0,
"confidence": 1.0,
"cached": false
},
{
"id": "38ed85d809f4a742",
"text": "Verify the answer explains that retrieved passages are injected into the model context before generation.",
"tokens": 16.0,
"confidence": 1.0,
"cached": false
},
{
"id": "da6cb2619f1531c9",
"text": "Verify the answer does not claim RAG completely eliminates hallucinations.",
"tokens": 12.0,
"confidence": 1.0,
"cached": false
}
],
"steps_source": "provided",
"signature": null
},
{
"key": "reference_alignment",
"name": "reference_alignment",
"item_fields": ["generation", "reference"],
"evaluation_steps": [
{
"id": "8cc05dca4d2a4f8c",
"text": "Check if the generation preserves the key facts in the reference answer.",
"tokens": 13.0,
"confidence": 1.0,
"cached": false
},
{
"id": "d58c90b7035f0563",
"text": "Penalize contradictions or missing critical details relative to the reference.",
"tokens": 12.0,
"confidence": 1.0,
"cached": false
}
],
"steps_source": "generated",
"signature": "9c4e1e93cc1a2b11b97d63f4"
}
],
"cost": {
"cost": 0.00071,
"input_tokens": 248.0,
"output_tokens": 68.0
}
}Attribute meanings:
resolved_steps: one resolved artifact per GEval metric that was processedkey: unique metric key (adds#nsuffix only when metric names repeat)name: original metric name from inputitem_fields: fields required later bygevalscoringevaluation_steps: finalized list of stepItems for this metricsteps_source:provided,generated, orcache_usedsignature: cache signature for criteria-generated metrics;nullfor provided-step metricscost.cost: USD spent in this node (typically only for generated steps)cost.input_tokens/cost.output_tokens: token totals for step-generation calls
Usage
OUTPUT_DIR=./out/geval-steps
mkdir -p "$OUTPUT_DIR"CLI: Estimate Cost
nexagauge estimate geval_steps \
--input ./sample.json \
--limit 5 \
| tee "$OUTPUT_DIR/geval_steps_estimate.txt"estimate supports --input and --limit; to persist output in an output directory, redirect/tee to a file under OUTPUT_DIR.
CLI: Run Evaluation
nexagauge run geval_steps \
--input ./sample.json \
--output-dir "$OUTPUT_DIR" \
--limit 5If you want per-case report JSON files, run the terminal node:
nexagauge run eval \
--input ./sample.json \
--output-dir "$OUTPUT_DIR" \
--limit 5