CLI Quick Start
nexagauge is the command-line entry point for the nexa-gauge evaluation pipeline. It runs a selected branch of the DAG against a dataset, estimates its cost before execution, and manages the on-disk cache.
Use the CLI in three distinct modes:
| Command | Use it when you want to |
|---|---|
estimate | Preview uncached cost before making billable LLM calls. |
run | Execute an evaluation branch and optionally write report files. |
cache delete | Inspect or clear cached node outputs. |
1. Estimate Cost
estimate previews likely spend for uncached work without executing billable LLM calls.
# Estimate cost for the eval branch
nexagauge estimate eval --input sample.jsonEstimate (target=eval, branch summary)
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+
| node_name | model | status | cached | uncached | uncached_eligible | uncached_eligible_pct | cost_estimate |
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+
| scan | openai/gpt-4o-mini | skipped/ineligible | 0 / 17 | 17 / 17 | 17 / 17 | 100.0% | - |
| chunk | openai/gpt-4o-mini | zero_cost | 0 / 17 | 17 / 17 | 17 / 17 | 100.0% | - |
| claims | openai/gpt-4o-mini | billable | 0 / 17 | 17 / 17 | 17 / 17 | 100.0% | $0.002040 |
| dedup | openai/gpt-4o-mini | zero_cost | 0 / 17 | 17 / 17 | 17 / 17 | 100.0% | - |
| geval_steps | openai/gpt-4o-mini | billable | 0 / 17 | 17 / 17 | 6 / 17 | 35.3% | $0.000090 |
| relevance | openai/gpt-4o-mini | billable | 0 / 17 | 17 / 17 | 16 / 17 | 94.1% | $0.000444 |
| grounding | openai/gpt-4o-mini | billable | 0 / 17 | 17 / 17 | 3 / 17 | 17.6% | $0.000103 |
| redteam | openai/gpt-4o-mini | billable | 0 / 17 | 17 / 17 | 17 / 17 | 100.0% | $0.004442 |
| geval | openai/gpt-4o-mini | billable | 0 / 17 | 17 / 17 | 6 / 17 | 35.3% | $0.001064 |
| reference | openai/gpt-4o-mini | zero_cost | 0 / 17 | 17 / 17 | 10 / 17 | 58.8% | - |
| TOTAL | - | - | - | - | - | - | $0.008183 |
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+2. Run eval
run eval executes the full evaluation branch, writes report files when --output-dir is provided, and prints aggregate metric summaries.
# Run the eval branch and write per-case report JSON
nexagauge run eval --input sample.json --output-dir ./report --limit 10The eval target prints a high-level node summary first:
eval metrics summary across 10 case(s)
+-----------+---------+--------+--------+-----------+--------------+--------+
| node | metrics | scored | errors | avg_score | median_score | passed |
+-----------+---------+--------+--------+-----------+--------------+--------+
| grounding | 3 | 3 | 0 | 0.8667 | 1.0000 | 3/3 |
| geval | 5 | 5 | 0 | 0.9215 | 0.9984 | 5/5 |
| relevance | 10 | 10 | 0 | 0.8000 | 1.0000 | 8/10 |
| reference | 30 | 30 | 0 | 0.1910 | 0.1356 | - |
| redteam | 20 | 20 | 0 | 0.8000 | 1.0000 | 16/20 |
+-----------+---------+--------+--------+-----------+--------------+--------+It then prints the per-metric breakdown:
eval metric breakdown across 10 case(s)
+-----------+----------------------------+---------+--------+--------+-----------+--------------+--------+
| node | metric | metrics | scored | errors | avg_score | median_score | passed |
+-----------+----------------------------+---------+--------+--------+-----------+--------------+--------+
| grounding | grounding | 3 | 3 | 0 | 0.8667 | 1.0000 | 3/3 |
| geval | diabetes_criterion_1 | 1 | 1 | 0 | 0.7332 | 0.7332 | 1/1 |
| geval | diabetes_criterion_2 | 1 | 1 | 0 | 1.0000 | 1.0000 | 1/1 |
| geval | photosynthesis_criterion_1 | 1 | 1 | 0 | 1.0000 | 1.0000 | 1/1 |
| geval | transformer_criterion_1 | 1 | 1 | 0 | 0.8758 | 0.8758 | 1/1 |
| geval | transformer_criterion_2 | 1 | 1 | 0 | 0.9984 | 0.9984 | 1/1 |
| relevance | answer_relevancy | 10 | 10 | 0 | 0.8000 | 1.0000 | 8/10 |
| reference | bleu | 6 | 6 | 0 | 0.0687 | 0.0132 | - |
| reference | meteor | 6 | 6 | 0 | 0.2784 | 0.2511 | - |
| reference | rouge1 | 6 | 6 | 0 | 0.2636 | 0.2066 | - |
| reference | rouge2 | 6 | 6 | 0 | 0.1233 | 0.0546 | - |
| reference | rougeL | 6 | 6 | 0 | 0.2208 | 0.1498 | - |
| redteam | bias | 10 | 10 | 0 | 0.8000 | 1.0000 | 8/10 |
| redteam | toxicity | 10 | 10 | 0 | 0.8000 | 1.0000 | 8/10 |
+-----------+----------------------------+---------+--------+--------+-----------+--------------+--------+
Done. cases=10 succeeded=10 failed=0 executed_steps=120 cached_steps=0Summary columns:
| Column | Meaning |
|---|---|
node | Metric node that produced the scores. |
metric | Specific metric name within a node; only present in the breakdown table. |
metrics | Number of metric values included in the aggregate row. |
scored | Number of metric values successfully scored. |
errors | Number of metric values that failed during scoring. |
avg_score | Average score for the row. |
median_score | Median score for the row. |
passed | Pass count for thresholded metrics; - means the metric does not emit pass/fail verdicts. |
Cache Effect After run
If you estimate again after a run, cached nodes are reflected in the estimate summary:
# Re-estimate after some cache has been populated
nexagauge estimate eval --input sample.jsonEstimate (target=eval, branch summary)
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+
| node_name | model | status | cached | uncached | uncached_eligible | uncached_eligible_pct | cost_estimate |
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+
| scan | openai/gpt-4o-mini | skipped/ineligible | 10 / 17 | 7 / 17 | 7 / 17 | 41.2% | - |
| chunk | openai/gpt-4o-mini | zero_cost | 10 / 17 | 7 / 17 | 7 / 17 | 41.2% | - |
| claims | openai/gpt-4o-mini | billable | 10 / 17 | 7 / 17 | 7 / 17 | 41.2% | $0.000792 |
| dedup | openai/gpt-4o-mini | zero_cost | 10 / 17 | 7 / 17 | 7 / 17 | 41.2% | - |
| geval_steps | openai/gpt-4o-mini | billable | 10 / 17 | 7 / 17 | 3 / 17 | 17.6% | $0.000045 |
| relevance | openai/gpt-4o-mini | billable | 10 / 17 | 7 / 17 | 6 / 17 | 35.3% | $0.000168 |
| grounding | openai/gpt-4o-mini | skipped/ineligible | 10 / 17 | 7 / 17 | 0 / 17 | 0.0% | - |
| redteam | openai/gpt-4o-mini | billable | 10 / 17 | 7 / 17 | 7 / 17 | 41.2% | $0.001789 |
| geval | openai/gpt-4o-mini | billable | 10 / 17 | 7 / 17 | 3 / 17 | 17.6% | $0.000583 |
| reference | openai/gpt-4o-mini | zero_cost | 10 / 17 | 7 / 17 | 4 / 17 | 23.5% | - |
| TOTAL | - | - | - | - | - | - | $0.003376 |
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+3. Cache Delete
cache delete removes cached node outputs. Use --dry-run first to preview what would be deleted.
# Preview cache cleanup first
nexagauge cache delete --dry-runFiles: 126 Size: 227.2 KB
Delete 126 files (227.2 KB)? [y/N]: y
Freed 227.2 KB (126 files)