GEval Score (geval)

Overview

geval is the scoring stage of nexa-gauge’s GEval branch. It applies the “LLM-as-a-judge with explicit evaluation steps” pattern from G-Eval (arXiv:2303.16634): evaluate generation quality against structured, metric-specific steps rather than relying only on lexical overlap metrics.

In the paper, the key idea is to improve human alignment by using evaluation criteria plus concrete intermediate steps. nexa-gauge operationalizes this in two phases: geval_steps resolves steps (from provided evaluation_steps or generated from criteria), then geval scores each metric using those resolved steps with the judge model.

This node is useful when you need rubric-driven answer quality scoring across custom dimensions like concept coverage, procedural correctness, reference alignment, and other task-specific checks. Each metric is scored independently, with machine-readable pass/fail reasoning attached per metric result.

geval is an answer-quality metric. It does not perform claim extraction, grounding support checks, or reference n-gram similarity. It consumes already-resolved GEval metric definitions and produces normalized metric outputs and token/cost accounting.

Use Case

Use geval when you want customizable, rubric-based evaluation of generated answers.

  • Evaluate domain-specific criteria not covered by generic metrics
  • Mix explicit evaluation_steps with criteria-generated steps
  • Run consistent grading for QA, RAG, assistant responses, and summarization
  • Add interpretable per-metric reasoning and pass/fail signals
  • Track score plus token/cost usage for evaluation governance

Node Overview

In nexa-gauge, geval is the scoring node after geval_steps.

  • Branch: scan -> geval_steps -> geval
  • scan normalizes input record fields into typed Inputs
  • geval_steps builds resolved_steps:
    • pass-through for metrics with provided evaluation_steps
    • generated/cache-loaded steps for metrics with only criteria
  • geval scores each resolved metric using Deepeval GEval

Per metric behavior in geval:

  • Validates required item_fields (question, generation, reference, context)
  • Skips metric with error if any required field is missing
  • Skips metric with error if resolved steps are empty
  • Otherwise returns MetricResult with:
    • score
    • result[0].passed (score >= 0.5)
    • result[0].reasoning
    • result[0].tokens

Execution Flow

Graph
Rendering diagram...

Input

Using your sample input, the geval scoring node ultimately uses:

  • generation text
  • question text
  • reference text (only if a metric’s item_fields includes reference)
  • context text (only if a metric’s item_fields includes context)
  • geval.metrics[*] indirectly, via geval_steps.resolved_steps

How your sample maps at runtime:

  • rag_concept_coverage (item_fields: [question, generation]): scored
  • retrieval_pipeline_steps (item_fields: [question, generation]): scored
  • reference_alignment (item_fields: [generation, reference]): skipped with error, because sample input does not include reference

Direct node signature in code:

run(resolved_artifacts, generation, question, reference, context)

So geval does not read raw criteria directly. It reads resolved metric artifacts output by geval_steps.

Output

For geval/score.py, the concrete output type is GevalMetrics in nexa_gauge_core/types.py.

  • metrics: list[MetricResult]
  • cost: CostEstimate | None

Note: RelevanceMetrics has a similar top-level shape (metrics + cost), but this node returns GevalMetrics.

Example output for your sample input:

json
{
  "metrics": [
    {
      "name": "rag_concept_coverage",
      "category": "answer",
      "score": 0.83,
      "result": [
        {
          "passed": true,
          "reasoning": "The response explains RAG and contrasts it with fine-tuning, including update cadence and cost tradeoffs.",
          "tokens": 19
        }
      ],
      "error": null
    },
    {
      "name": "retrieval_pipeline_steps",
      "category": "answer",
      "score": 0.66,
      "result": [
        {
          "passed": true,
          "reasoning": "It covers retrieval and context injection, but caveats about hallucinations are only partially explicit.",
          "tokens": 18
        }
      ],
      "error": null
    },
    {
      "name": "reference_alignment",
      "category": "answer",
      "score": null,
      "result": null,
      "error": "Skipped GEval metric due to missing required record fields: reference."
    }
  ],
  "cost": {
    "cost": 0.00102,
    "input_tokens": 312.0,
    "output_tokens": 74.0
  }
}

Attribute meanings:

  • metrics: one MetricResult per resolved GEval metric
  • name: metric name from GEval config
  • category: always answer for this node
  • score: numeric metric score when evaluated; null when skipped/error
  • result: list payload for successful evaluations
  • result[].passed: boolean thresholded by METRIC_PASS_THRESHOLD (0.5)
  • result[].reasoning: judge explanation text
  • result[].tokens: token count of reasoning text
  • error: skip/failure reason for that metric
  • cost.cost: summed evaluation USD cost
  • cost.input_tokens / cost.output_tokens: aggregate usage from scoring calls

Usage

bash
OUTPUT_DIR=./out/geval-score
mkdir -p "$OUTPUT_DIR"

CLI: Estimate Cost

bash
nexagauge estimate geval \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/geval_estimate.txt"

estimate supports --input and --limit; it does not expose --output-dir, so save output into your output directory with tee.

CLI: Run Evaluation

bash
nexagauge run geval \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

If you want per-case report JSON files, run through eval:

bash
nexagauge run eval \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5