GEval Score (`geval`)

Overview

geval is the scoring stage of nexa-gauge’s GEval branch. It applies the “LLM-as-a-judge with explicit evaluation steps” pattern from G-Eval (arXiv:2303.16634): evaluate generation quality against structured, metric-specific steps rather than relying only on lexical overlap metrics.

In the paper, the key idea is to improve human alignment by using evaluation criteria plus concrete intermediate steps. nexa-gauge operationalizes this in two phases: geval_steps resolves steps (from provided evaluation_steps or generated from criteria), then geval scores each metric using those resolved steps with the judge model.

This node is useful when you need rubric-driven answer quality scoring across custom dimensions like concept coverage, procedural correctness, reference alignment, and other task-specific checks. Each metric is scored independently, with machine-readable pass/fail reasoning attached per metric result.

geval is an answer-quality metric. It does not perform claim extraction, grounding support checks, or reference n-gram similarity. It consumes already-resolved GEval metric definitions and produces normalized metric outputs and token/cost accounting.

Use Case

Use geval when you want customizable, rubric-based evaluation of generated answers.

Evaluate domain-specific criteria not covered by generic metrics
Mix explicit evaluation_steps with criteria-generated steps
Run consistent grading for QA, RAG, assistant responses, and summarization
Add interpretable per-metric reasoning and pass/fail signals
Track score plus token/cost usage for evaluation governance

Node Overview

In nexa-gauge, geval is the scoring node after geval_steps.

Branch: scan -> geval_steps -> geval
scan normalizes input record fields into typed Inputs
geval_steps builds resolved_steps:
- pass-through for metrics with provided evaluation_steps
- generated/cache-loaded steps for metrics with only criteria
geval scores each resolved metric using Deepeval GEval

Per metric behavior in geval:

Validates required item_fields (question, generation, reference, context)
Skips metric with error if any required field is missing
Skips metric with error if resolved steps are empty
Otherwise returns MetricResult with:
- score
- result[0].passed (score >= 0.5)
- result[0].reasoning
- result[0].tokens

Execution Flow

Graph

Rendering diagram...

Input

Using your sample input, the geval scoring node ultimately uses:

generation text
question text
reference text (only if a metric’s item_fields includes reference)
context text (only if a metric’s item_fields includes context)
geval.metrics[*] indirectly, via geval_steps.resolved_steps

How your sample maps at runtime:

rag_concept_coverage (item_fields: [question, generation]): scored
retrieval_pipeline_steps (item_fields: [question, generation]): scored
reference_alignment (item_fields: [generation, reference]): skipped with error, because sample input does not include reference

Direct node signature in code:

run(resolved_artifacts, generation, question, reference, context)

So geval does not read raw criteria directly. It reads resolved metric artifacts output by geval_steps.

Output

For geval/score.py, the concrete output type is GevalMetrics in nexa_gauge_core/types.py.

metrics: list[MetricResult]
cost: CostEstimate | None

Note: RelevanceMetrics has a similar top-level shape (metrics + cost), but this node returns GevalMetrics.

Example output for your sample input:

json

{
  "metrics": [
    {
      "name": "rag_concept_coverage",
      "category": "answer",
      "score": 0.83,
      "result": [
        {
          "passed": true,
          "reasoning": "The response explains RAG and contrasts it with fine-tuning, including update cadence and cost tradeoffs.",
          "tokens": 19
        }
      ],
      "error": null
    },
    {
      "name": "retrieval_pipeline_steps",
      "category": "answer",
      "score": 0.66,
      "result": [
        {
          "passed": true,
          "reasoning": "It covers retrieval and context injection, but caveats about hallucinations are only partially explicit.",
          "tokens": 18
        }
      ],
      "error": null
    },
    {
      "name": "reference_alignment",
      "category": "answer",
      "score": null,
      "result": null,
      "error": "Skipped GEval metric due to missing required record fields: reference."
    }
  ],
  "cost": {
    "cost": 0.00102,
    "input_tokens": 312.0,
    "output_tokens": 74.0
  }
}

Attribute meanings:

metrics: one MetricResult per resolved GEval metric
name: metric name from GEval config
category: always answer for this node
score: numeric metric score when evaluated; null when skipped/error
result: list payload for successful evaluations
result[].passed: boolean thresholded by METRIC_PASS_THRESHOLD (0.5)
result[].reasoning: judge explanation text
result[].tokens: token count of reasoning text
error: skip/failure reason for that metric
cost.cost: summed evaluation USD cost
cost.input_tokens / cost.output_tokens: aggregate usage from scoring calls

Usage

bash

OUTPUT_DIR=./out/geval-score
mkdir -p "$OUTPUT_DIR"

CLI: Estimate Cost

bash

nexagauge estimate geval \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/geval_estimate.txt"

estimate supports --input and --limit; it does not expose --output-dir, so save output into your output directory with tee.

CLI: Run Evaluation

bash

nexagauge run geval \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

If you want per-case report JSON files, run through eval:

bash

nexagauge run eval \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

GEval Score (geval)

Overview

Use Case

Node Overview

Execution Flow

Input

Output

Usage

CLI: Estimate Cost

CLI: Run Evaluation

GEval Score (`geval`)