GEval Score (geval)
Overview
geval is the scoring stage of nexa-gauge’s GEval branch. It applies the “LLM-as-a-judge with explicit evaluation steps” pattern from G-Eval (arXiv:2303.16634): evaluate generation quality against structured, metric-specific steps rather than relying only on lexical overlap metrics.
In the paper, the key idea is to improve human alignment by using evaluation criteria plus concrete intermediate steps. nexa-gauge operationalizes this in two phases: geval_steps resolves steps (from provided evaluation_steps or generated from criteria), then geval scores each metric using those resolved steps with the judge model.
This node is useful when you need rubric-driven answer quality scoring across custom dimensions like concept coverage, procedural correctness, reference alignment, and other task-specific checks. Each metric is scored independently, with machine-readable pass/fail reasoning attached per metric result.
geval is an answer-quality metric. It does not perform claim extraction, grounding support checks, or reference n-gram similarity. It consumes already-resolved GEval metric definitions and produces normalized metric outputs and token/cost accounting.
Use Case
Use geval when you want customizable, rubric-based evaluation of generated answers.
- Evaluate domain-specific criteria not covered by generic metrics
- Mix explicit
evaluation_stepswith criteria-generated steps - Run consistent grading for QA, RAG, assistant responses, and summarization
- Add interpretable per-metric reasoning and pass/fail signals
- Track score plus token/cost usage for evaluation governance
Node Overview
In nexa-gauge, geval is the scoring node after geval_steps.
- Branch:
scan -> geval_steps -> geval scannormalizes input record fields into typedInputsgeval_stepsbuildsresolved_steps:- pass-through for metrics with provided
evaluation_steps - generated/cache-loaded steps for metrics with only
criteria
- pass-through for metrics with provided
gevalscores each resolved metric using DeepevalGEval
Per metric behavior in geval:
- Validates required
item_fields(question,generation,reference,context) - Skips metric with
errorif any required field is missing - Skips metric with
errorif resolved steps are empty - Otherwise returns
MetricResultwith:scoreresult[0].passed(score >= 0.5)result[0].reasoningresult[0].tokens
Execution Flow
Input
Using your sample input, the geval scoring node ultimately uses:
generationtextquestiontextreferencetext (only if a metric’sitem_fieldsincludesreference)contexttext (only if a metric’sitem_fieldsincludescontext)geval.metrics[*]indirectly, viageval_steps.resolved_steps
How your sample maps at runtime:
rag_concept_coverage(item_fields: [question, generation]): scoredretrieval_pipeline_steps(item_fields: [question, generation]): scoredreference_alignment(item_fields: [generation, reference]): skipped with error, because sample input does not includereference
Direct node signature in code:
run(resolved_artifacts, generation, question, reference, context)
So geval does not read raw criteria directly. It reads resolved metric artifacts output by geval_steps.
Output
For geval/score.py, the concrete output type is GevalMetrics in nexa_gauge_core/types.py.
metrics: list[MetricResult]cost: CostEstimate | None
Note: RelevanceMetrics has a similar top-level shape (metrics + cost), but this node returns GevalMetrics.
Example output for your sample input:
{
"metrics": [
{
"name": "rag_concept_coverage",
"category": "answer",
"score": 0.83,
"result": [
{
"passed": true,
"reasoning": "The response explains RAG and contrasts it with fine-tuning, including update cadence and cost tradeoffs.",
"tokens": 19
}
],
"error": null
},
{
"name": "retrieval_pipeline_steps",
"category": "answer",
"score": 0.66,
"result": [
{
"passed": true,
"reasoning": "It covers retrieval and context injection, but caveats about hallucinations are only partially explicit.",
"tokens": 18
}
],
"error": null
},
{
"name": "reference_alignment",
"category": "answer",
"score": null,
"result": null,
"error": "Skipped GEval metric due to missing required record fields: reference."
}
],
"cost": {
"cost": 0.00102,
"input_tokens": 312.0,
"output_tokens": 74.0
}
}Attribute meanings:
metrics: oneMetricResultper resolved GEval metricname: metric name from GEval configcategory: alwaysanswerfor this nodescore: numeric metric score when evaluated;nullwhen skipped/errorresult: list payload for successful evaluationsresult[].passed: boolean thresholded byMETRIC_PASS_THRESHOLD(0.5)result[].reasoning: judge explanation textresult[].tokens: token count of reasoning texterror: skip/failure reason for that metriccost.cost: summed evaluation USD costcost.input_tokens/cost.output_tokens: aggregate usage from scoring calls
Usage
OUTPUT_DIR=./out/geval-score
mkdir -p "$OUTPUT_DIR"CLI: Estimate Cost
nexagauge estimate geval \
--input ./sample.json \
--limit 5 \
| tee "$OUTPUT_DIR/geval_estimate.txt"estimate supports --input and --limit; it does not expose --output-dir, so save output into your output directory with tee.
CLI: Run Evaluation
nexagauge run geval \
--input ./sample.json \
--output-dir "$OUTPUT_DIR" \
--limit 5If you want per-case report JSON files, run through eval:
nexagauge run eval \
--input ./sample.json \
--output-dir "$OUTPUT_DIR" \
--limit 5