Introduction

Overview

nexa-gauge is a graph-based evaluation system for LLM and LVLM application outputs. It replaces ad-hoc manual checks with a repeatable pipeline that can be run on local datasets or hosted datasets.

At a high level, nexa-gauge:

  • Normalizes raw records into a typed evaluation state.
  • Executes only the nodes required for the selected target.
  • Reuses prior node outputs through deterministic caching.
  • Produces a consistent per-case report for downstream tooling.

This architecture supports day-to-day prompt iteration, benchmark runs, and release gating with measurable quality and safety signals.

Why LLM-As-A-Judge Is Necessary

Exact-match metrics are useful but limited for modern generative systems. In many real tasks, multiple answers can be valid, quality depends on context use, and failure modes are semantic rather than lexical.

LLM-as-a-judge provides scalable semantic evaluation by scoring outputs against explicit criteria. In nexa-gauge, this capability is combined with targeted metrics so teams can evaluate quality from multiple angles:

  • relevance for question-answer alignment.
  • grounding for support in provided context.
  • redteam for safety and risk behavior.
  • geval for rubric-based judgment.
  • reference for overlap with known reference answers.

Execution Model And Caching

nexa-gauge provides two operational modes:

  • run executes the selected branch and returns final artifacts.
  • estimate computes uncached eligible cost before execution.

Both modes follow the same branch-planning logic, which makes cost estimates actionable before you run full evaluations.

Caching is route-aware and deterministic. Reuse occurs only when input content and routing semantics are unchanged. Changes to inputs, prompts, or model routing intentionally invalidate affected steps.

Practical outcome:

  • Teams can estimate budget before execution.
  • Iterative runs avoid recomputing stable nodes.
  • Results remain reproducible under fixed inputs and model routes.

Architecture

Graph
Rendering diagram...

Node Summary

Input And Orchestration

NodePurpose
scanNormalizes record fields and initializes case state.
evalAggregates metric branches into a unified result.
reportProjects final output into a stable report contract.

Utility Nodes

NodePurpose
chunkSplits generated text for downstream extraction.
claimsExtracts atomic claims from generated output.
dedupRemoves duplicate claims before scoring.
geval_stepsResolves evaluation steps for GEval scoring.

Metric Nodes

NodePurpose
relevanceMeasures how directly claims answer the question.
groundingMeasures whether claims are supported by context.
redteamEvaluates safety and policy risk using rubrics.
gevalRuns final rubric-driven LLM judging.
referenceComputes reference-based lexical metrics.

Typical Workflow

bash
# Estimate full evaluation cost for a dataset slice
nexagauge estimate eval --input sample.json --limit 100

# Run full evaluation and write per-case report files
nexagauge run eval --input sample.json --limit 100 --output-dir ./report

For iterative development, repeated runs on unchanged inputs and routing should show high cache reuse and lower incremental latency.