GEval Steps (geval_steps)

Overview

The geval_steps node implements the “auto-evaluation-steps” part of the G-Eval approach: turn a high-level criterion into concrete, testable evaluation steps before scoring output quality.

In G-Eval (arXiv:2303.16634), the core idea is to improve LLM-as-a-judge reliability by combining: task criteria, chain-of-thought style intermediate evaluation steps, and structured scoring. nexa-gauge mirrors this by generating 2-3 measurable steps from criteria when explicit evaluation_steps are not provided.

This gives two practical benefits for evaluation pipelines:

  • You can define concise criteria once, and let the system synthesize consistent step-level rubrics.
  • You can cache generated steps by signature (criteria + model + prompt/parser version), so repeated runs avoid extra LLM cost.

The node does not assign final metric scores. It prepares resolved steps that downstream geval scoring uses. Metrics that already include evaluation_steps bypass generation and are passed through as-is.

Use Case

Use geval_steps when you want rubric-driven evaluation without manually writing steps for every metric.

  • Standardize judge instructions from short criteria text
  • Reduce prompt-writing overhead for many custom GEval metrics
  • Reuse cached step artifacts across repeated runs
  • Mix explicit step metrics and criteria-only metrics in one config
  • Keep scoring behavior consistent by resolving all metrics to step lists first

Node Overview

In nexa-gauge, geval_steps is a preprocessing node for the GEval branch.

  • Input source: inputs.geval.metrics from scanner-normalized record input
  • For each GEval metric:
    • If evaluation_steps exists and is non-empty: mark steps_source="provided"
    • Else if criteria exists: resolve from cache or generate via LLM
    • Else: skip metric (no criteria and no steps)
  • Output object: GevalStepsArtifacts (resolved steps + step-generation cost)

Important behavior:

  • Duplicate metric names get deterministic keys (name#1, name#2, ...).
  • Only criteria-based metrics trigger step generation.
  • Metrics with user-provided evaluation_steps do not call the step-generation prompt.

Execution Flow

Graph
Rendering diagram...

Input

Using your sample input, these fields are used by the geval_steps branch:

json
  {
    "case_id": "gpt-rag-explanation-large",
    "question": "Explain retrieval-augmented generation (RAG) and when it should be used over fine-tuning.",
    "generation": "Retrieval-Augmented Generation (RAG) is a technique that combines ....",
    "geval": {
      "metrics": [
        {
          "name": "rag_concept_coverage",
          "item_fields": ["question", "generation"],
          "criteria": "The answer clearly explains what retrieval-augmented generation is and when to prefer it over fine-tuning."
        },
        {
          "name": "retrieval_pipeline_steps",
          "item_fields": ["question", "generation"],
          "evaluation_steps": [
            "Verify the answer describes embedding the query and retrieving semantically similar passages.",
            "Verify the answer explains that retrieved passages are injected into the model context before generation.",
            "Verify the answer does not claim RAG completely eliminates hallucinations."
          ]
        },
        {
          "name": "reference_alignment",
          "item_fields": ["generation", "reference"],
          "criteria": "The answer must align with the provided reference answer."
        }
      ]
    }
  } 
  • generation: used as branch eligibility gate (has_generation must be true)
  • geval.metrics[*].name: used in output name and dedup-safe key
  • geval.metrics[*].item_fields: copied to resolved artifacts
  • geval.metrics[*].evaluation_steps: if present, used directly (no LLM generation)
  • geval.metrics[*].criteria: used only when evaluation_steps is empty

Field-by-field for your three sample metrics:

  • rag_concept_coverage: uses criteria -> generates (or cache-loads) steps
  • retrieval_pipeline_steps: uses provided evaluation_steps directly
  • reference_alignment: uses criteria -> generates (or cache-loads) steps

Not used by geval_steps for generation logic:

  • question, reference, text content themselves
  • These are used later by geval scoring, not by step resolution

Output

For this node, the concrete output type is GevalStepsArtifacts.

  • resolved_steps: list[GevalStepsResolved]
  • cost: CostEstimate

Example output:

json
{
  "resolved_steps": [
    {
      "key": "rag_concept_coverage",
      "name": "rag_concept_coverage",
      "item_fields": ["question", "generation"],
      "evaluation_steps": [
        {
          "id": "f3e87c1c2e45d214",
          "text": "Check whether the answer defines RAG as retrieval + generation over external knowledge.",
          "tokens": 17.0,
          "confidence": 1.0,
          "cached": false
        },
        {
          "id": "6b9b8692a6cf05fa",
          "text": "Check whether the answer explains when RAG is preferable to fine-tuning (freshness, source grounding, lower retraining cost).",
          "tokens": 22.0,
          "confidence": 1.0,
          "cached": false
        }
      ],
      "steps_source": "generated",
      "signature": "a3f9f8cb9b2a40cde15a734a"
    },
    {
      "key": "retrieval_pipeline_steps",
      "name": "retrieval_pipeline_steps",
      "item_fields": ["question", "generation"],
      "evaluation_steps": [
        {
          "id": "247f5bf2293f6051",
          "text": "Verify the answer describes embedding the query and retrieving semantically similar passages.",
          "tokens": 14.0,
          "confidence": 1.0,
          "cached": false
        },
        {
          "id": "38ed85d809f4a742",
          "text": "Verify the answer explains that retrieved passages are injected into the model context before generation.",
          "tokens": 16.0,
          "confidence": 1.0,
          "cached": false
        },
        {
          "id": "da6cb2619f1531c9",
          "text": "Verify the answer does not claim RAG completely eliminates hallucinations.",
          "tokens": 12.0,
          "confidence": 1.0,
          "cached": false
        }
      ],
      "steps_source": "provided",
      "signature": null
    },
    {
      "key": "reference_alignment",
      "name": "reference_alignment",
      "item_fields": ["generation", "reference"],
      "evaluation_steps": [
        {
          "id": "8cc05dca4d2a4f8c",
          "text": "Check if the generation preserves the key facts in the reference answer.",
          "tokens": 13.0,
          "confidence": 1.0,
          "cached": false
        },
        {
          "id": "d58c90b7035f0563",
          "text": "Penalize contradictions or missing critical details relative to the reference.",
          "tokens": 12.0,
          "confidence": 1.0,
          "cached": false
        }
      ],
      "steps_source": "generated",
      "signature": "9c4e1e93cc1a2b11b97d63f4"
    }
  ],
  "cost": {
    "cost": 0.00071,
    "input_tokens": 248.0,
    "output_tokens": 68.0
  }
}

Attribute meanings:

  • resolved_steps: one resolved artifact per GEval metric that was processed
  • key: unique metric key (adds #n suffix only when metric names repeat)
  • name: original metric name from input
  • item_fields: fields required later by geval scoring
  • evaluation_steps: finalized list of step Items for this metric
  • steps_source: provided, generated, or cache_used
  • signature: cache signature for criteria-generated metrics; null for provided-step metrics
  • cost.cost: USD spent in this node (typically only for generated steps)
  • cost.input_tokens / cost.output_tokens: token totals for step-generation calls

Usage

bash
OUTPUT_DIR=./out/geval-steps
mkdir -p "$OUTPUT_DIR"

CLI: Estimate Cost

bash
nexagauge estimate geval_steps \
  --input ./sample.json \
  --limit 5 \
  | tee "$OUTPUT_DIR/geval_steps_estimate.txt"

estimate supports --input and --limit; to persist output in an output directory, redirect/tee to a file under OUTPUT_DIR.

CLI: Run Evaluation

bash
nexagauge run geval_steps \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5

If you want per-case report JSON files, run the terminal node:

bash
nexagauge run eval \
  --input ./sample.json \
  --output-dir "$OUTPUT_DIR" \
  --limit 5