CLI Quick Start

nexagauge is the command-line entry point for the nexa-gauge evaluation pipeline. It runs a selected branch of the DAG against a dataset, estimates its cost before execution, and manages the on-disk cache.

Use the CLI in three distinct modes:

Command	Use it when you want to
`estimate`	Preview uncached cost before making billable LLM calls.
`run`	Execute an evaluation branch and optionally write report files.
`cache delete`	Inspect or clear cached node outputs.

1. Estimate Cost

estimate previews likely spend for uncached work without executing billable LLM calls.

bash

# Estimate cost for the eval branch
nexagauge estimate eval --input sample.json

text

Estimate (target=eval, branch summary)
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+
| node_name   | model              | status             | cached  | uncached | uncached_eligible | uncached_eligible_pct | cost_estimate |
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+
| scan        | openai/gpt-4o-mini | skipped/ineligible |  0 / 17 |  17 / 17 |           17 / 17 |                100.0% |             - |
| chunk       | openai/gpt-4o-mini | zero_cost          |  0 / 17 |  17 / 17 |           17 / 17 |                100.0% |             - |
| claims      | openai/gpt-4o-mini | billable           |  0 / 17 |  17 / 17 |           17 / 17 |                100.0% |     $0.002040 |
| dedup       | openai/gpt-4o-mini | zero_cost          |  0 / 17 |  17 / 17 |           17 / 17 |                100.0% |             - |
| geval_steps | openai/gpt-4o-mini | billable           |  0 / 17 |  17 / 17 |            6 / 17 |                 35.3% |     $0.000090 |
| relevance   | openai/gpt-4o-mini | billable           |  0 / 17 |  17 / 17 |           16 / 17 |                 94.1% |     $0.000444 |
| grounding   | openai/gpt-4o-mini | billable           |  0 / 17 |  17 / 17 |            3 / 17 |                 17.6% |     $0.000103 |
| redteam     | openai/gpt-4o-mini | billable           |  0 / 17 |  17 / 17 |           17 / 17 |                100.0% |     $0.004442 |
| geval       | openai/gpt-4o-mini | billable           |  0 / 17 |  17 / 17 |            6 / 17 |                 35.3% |     $0.001064 |
| reference   | openai/gpt-4o-mini | zero_cost          |  0 / 17 |  17 / 17 |           10 / 17 |                 58.8% |             - |
| TOTAL       | -                  | -                  |       - |        - |                 - |                     - |     $0.008183 |
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+

2. Run `eval`

run eval executes the full evaluation branch, writes report files when --output-dir is provided, and prints aggregate metric summaries.

bash

# Run the eval branch and write per-case report JSON
nexagauge run eval --input sample.json --output-dir ./report --limit 10

The eval target prints a high-level node summary first:

text

eval metrics summary across 10 case(s)
+-----------+---------+--------+--------+-----------+--------------+--------+
| node      | metrics | scored | errors | avg_score | median_score | passed |
+-----------+---------+--------+--------+-----------+--------------+--------+
| grounding |       3 |      3 |      0 |    0.8667 |       1.0000 |    3/3 |
| geval     |       5 |      5 |      0 |    0.9215 |       0.9984 |    5/5 |
| relevance |      10 |     10 |      0 |    0.8000 |       1.0000 |   8/10 |
| reference |      30 |     30 |      0 |    0.1910 |       0.1356 |      - |
| redteam   |      20 |     20 |      0 |    0.8000 |       1.0000 |  16/20 |
+-----------+---------+--------+--------+-----------+--------------+--------+

It then prints the per-metric breakdown:

text

eval metric breakdown across 10 case(s)
+-----------+----------------------------+---------+--------+--------+-----------+--------------+--------+
| node      | metric                     | metrics | scored | errors | avg_score | median_score | passed |
+-----------+----------------------------+---------+--------+--------+-----------+--------------+--------+
| grounding | grounding                  |       3 |      3 |      0 |    0.8667 |       1.0000 |    3/3 |
| geval     | diabetes_criterion_1       |       1 |      1 |      0 |    0.7332 |       0.7332 |    1/1 |
| geval     | diabetes_criterion_2       |       1 |      1 |      0 |    1.0000 |       1.0000 |    1/1 |
| geval     | photosynthesis_criterion_1 |       1 |      1 |      0 |    1.0000 |       1.0000 |    1/1 |
| geval     | transformer_criterion_1    |       1 |      1 |      0 |    0.8758 |       0.8758 |    1/1 |
| geval     | transformer_criterion_2    |       1 |      1 |      0 |    0.9984 |       0.9984 |    1/1 |
| relevance | answer_relevancy           |      10 |     10 |      0 |    0.8000 |       1.0000 |   8/10 |
| reference | bleu                       |       6 |      6 |      0 |    0.0687 |       0.0132 |      - |
| reference | meteor                     |       6 |      6 |      0 |    0.2784 |       0.2511 |      - |
| reference | rouge1                     |       6 |      6 |      0 |    0.2636 |       0.2066 |      - |
| reference | rouge2                     |       6 |      6 |      0 |    0.1233 |       0.0546 |      - |
| reference | rougeL                     |       6 |      6 |      0 |    0.2208 |       0.1498 |      - |
| redteam   | bias                       |      10 |     10 |      0 |    0.8000 |       1.0000 |   8/10 |
| redteam   | toxicity                   |      10 |     10 |      0 |    0.8000 |       1.0000 |   8/10 |
+-----------+----------------------------+---------+--------+--------+-----------+--------------+--------+

Done.  cases=10  succeeded=10  failed=0  executed_steps=120  cached_steps=0

Summary columns:

Column	Meaning
`node`	Metric node that produced the scores.
`metric`	Specific metric name within a node; only present in the breakdown table.
`metrics`	Number of metric values included in the aggregate row.
`scored`	Number of metric values successfully scored.
`errors`	Number of metric values that failed during scoring.
`avg_score`	Average score for the row.
`median_score`	Median score for the row.
`passed`	Pass count for thresholded metrics; `-` means the metric does not emit pass/fail verdicts.

Cache Effect After `run`

If you estimate again after a run, cached nodes are reflected in the estimate summary:

bash

# Re-estimate after some cache has been populated
nexagauge estimate eval --input sample.json

text

Estimate (target=eval, branch summary)
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+
| node_name   | model              | status             | cached  | uncached | uncached_eligible | uncached_eligible_pct | cost_estimate |
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+
| scan        | openai/gpt-4o-mini | skipped/ineligible | 10 / 17 |   7 / 17 |            7 / 17 |                 41.2% |             - |
| chunk       | openai/gpt-4o-mini | zero_cost          | 10 / 17 |   7 / 17 |            7 / 17 |                 41.2% |             - |
| claims      | openai/gpt-4o-mini | billable           | 10 / 17 |   7 / 17 |            7 / 17 |                 41.2% |     $0.000792 |
| dedup       | openai/gpt-4o-mini | zero_cost          | 10 / 17 |   7 / 17 |            7 / 17 |                 41.2% |             - |
| geval_steps | openai/gpt-4o-mini | billable           | 10 / 17 |   7 / 17 |            3 / 17 |                 17.6% |     $0.000045 |
| relevance   | openai/gpt-4o-mini | billable           | 10 / 17 |   7 / 17 |            6 / 17 |                 35.3% |     $0.000168 |
| grounding   | openai/gpt-4o-mini | skipped/ineligible | 10 / 17 |   7 / 17 |            0 / 17 |                  0.0% |             - |
| redteam     | openai/gpt-4o-mini | billable           | 10 / 17 |   7 / 17 |            7 / 17 |                 41.2% |     $0.001789 |
| geval       | openai/gpt-4o-mini | billable           | 10 / 17 |   7 / 17 |            3 / 17 |                 17.6% |     $0.000583 |
| reference   | openai/gpt-4o-mini | zero_cost          | 10 / 17 |   7 / 17 |            4 / 17 |                 23.5% |             - |
| TOTAL       | -                  | -                  |       - |        - |                 - |                     - |     $0.003376 |
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+

3. Cache Delete

cache delete removes cached node outputs. Use --dry-run first to preview what would be deleted.

bash

# Preview cache cleanup first
nexagauge cache delete --dry-run

text

Files: 126   Size: 227.2 KB
Delete 126 files (227.2 KB)? [y/N]: y
Freed 227.2 KB (126 files)

CLI Quick Start

1. Estimate Cost

2. Run eval

Cache Effect After run

3. Cache Delete

2. Run `eval`

Cache Effect After `run`