CLI Quick Start

nexagauge is the command-line entry point for the nexa-gauge evaluation pipeline. It runs a selected branch of the DAG against a dataset, estimates its cost before execution, and manages the on-disk cache.

Use the CLI in three distinct modes:

CommandUse it when you want to
estimatePreview uncached cost before making billable LLM calls.
runExecute an evaluation branch and optionally write report files.
cache deleteInspect or clear cached node outputs.

1. Estimate Cost

estimate previews likely spend for uncached work without executing billable LLM calls.

bash
# Estimate cost for the eval branch
nexagauge estimate eval --input sample.json
text
Estimate (target=eval, branch summary)
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+
| node_name   | model              | status             | cached  | uncached | uncached_eligible | uncached_eligible_pct | cost_estimate |
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+
| scan        | openai/gpt-4o-mini | skipped/ineligible |  0 / 17 |  17 / 17 |           17 / 17 |                100.0% |             - |
| chunk       | openai/gpt-4o-mini | zero_cost          |  0 / 17 |  17 / 17 |           17 / 17 |                100.0% |             - |
| claims      | openai/gpt-4o-mini | billable           |  0 / 17 |  17 / 17 |           17 / 17 |                100.0% |     $0.002040 |
| dedup       | openai/gpt-4o-mini | zero_cost          |  0 / 17 |  17 / 17 |           17 / 17 |                100.0% |             - |
| geval_steps | openai/gpt-4o-mini | billable           |  0 / 17 |  17 / 17 |            6 / 17 |                 35.3% |     $0.000090 |
| relevance   | openai/gpt-4o-mini | billable           |  0 / 17 |  17 / 17 |           16 / 17 |                 94.1% |     $0.000444 |
| grounding   | openai/gpt-4o-mini | billable           |  0 / 17 |  17 / 17 |            3 / 17 |                 17.6% |     $0.000103 |
| redteam     | openai/gpt-4o-mini | billable           |  0 / 17 |  17 / 17 |           17 / 17 |                100.0% |     $0.004442 |
| geval       | openai/gpt-4o-mini | billable           |  0 / 17 |  17 / 17 |            6 / 17 |                 35.3% |     $0.001064 |
| reference   | openai/gpt-4o-mini | zero_cost          |  0 / 17 |  17 / 17 |           10 / 17 |                 58.8% |             - |
| TOTAL       | -                  | -                  |       - |        - |                 - |                     - |     $0.008183 |
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+

2. Run eval

run eval executes the full evaluation branch, writes report files when --output-dir is provided, and prints aggregate metric summaries.

bash
# Run the eval branch and write per-case report JSON
nexagauge run eval --input sample.json --output-dir ./report --limit 10

The eval target prints a high-level node summary first:

text
eval metrics summary across 10 case(s)
+-----------+---------+--------+--------+-----------+--------------+--------+
| node      | metrics | scored | errors | avg_score | median_score | passed |
+-----------+---------+--------+--------+-----------+--------------+--------+
| grounding |       3 |      3 |      0 |    0.8667 |       1.0000 |    3/3 |
| geval     |       5 |      5 |      0 |    0.9215 |       0.9984 |    5/5 |
| relevance |      10 |     10 |      0 |    0.8000 |       1.0000 |   8/10 |
| reference |      30 |     30 |      0 |    0.1910 |       0.1356 |      - |
| redteam   |      20 |     20 |      0 |    0.8000 |       1.0000 |  16/20 |
+-----------+---------+--------+--------+-----------+--------------+--------+

It then prints the per-metric breakdown:

text
eval metric breakdown across 10 case(s)
+-----------+----------------------------+---------+--------+--------+-----------+--------------+--------+
| node      | metric                     | metrics | scored | errors | avg_score | median_score | passed |
+-----------+----------------------------+---------+--------+--------+-----------+--------------+--------+
| grounding | grounding                  |       3 |      3 |      0 |    0.8667 |       1.0000 |    3/3 |
| geval     | diabetes_criterion_1       |       1 |      1 |      0 |    0.7332 |       0.7332 |    1/1 |
| geval     | diabetes_criterion_2       |       1 |      1 |      0 |    1.0000 |       1.0000 |    1/1 |
| geval     | photosynthesis_criterion_1 |       1 |      1 |      0 |    1.0000 |       1.0000 |    1/1 |
| geval     | transformer_criterion_1    |       1 |      1 |      0 |    0.8758 |       0.8758 |    1/1 |
| geval     | transformer_criterion_2    |       1 |      1 |      0 |    0.9984 |       0.9984 |    1/1 |
| relevance | answer_relevancy           |      10 |     10 |      0 |    0.8000 |       1.0000 |   8/10 |
| reference | bleu                       |       6 |      6 |      0 |    0.0687 |       0.0132 |      - |
| reference | meteor                     |       6 |      6 |      0 |    0.2784 |       0.2511 |      - |
| reference | rouge1                     |       6 |      6 |      0 |    0.2636 |       0.2066 |      - |
| reference | rouge2                     |       6 |      6 |      0 |    0.1233 |       0.0546 |      - |
| reference | rougeL                     |       6 |      6 |      0 |    0.2208 |       0.1498 |      - |
| redteam   | bias                       |      10 |     10 |      0 |    0.8000 |       1.0000 |   8/10 |
| redteam   | toxicity                   |      10 |     10 |      0 |    0.8000 |       1.0000 |   8/10 |
+-----------+----------------------------+---------+--------+--------+-----------+--------------+--------+

Done.  cases=10  succeeded=10  failed=0  executed_steps=120  cached_steps=0

Summary columns:

ColumnMeaning
nodeMetric node that produced the scores.
metricSpecific metric name within a node; only present in the breakdown table.
metricsNumber of metric values included in the aggregate row.
scoredNumber of metric values successfully scored.
errorsNumber of metric values that failed during scoring.
avg_scoreAverage score for the row.
median_scoreMedian score for the row.
passedPass count for thresholded metrics; - means the metric does not emit pass/fail verdicts.

Cache Effect After run

If you estimate again after a run, cached nodes are reflected in the estimate summary:

bash
# Re-estimate after some cache has been populated
nexagauge estimate eval --input sample.json
text
Estimate (target=eval, branch summary)
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+
| node_name   | model              | status             | cached  | uncached | uncached_eligible | uncached_eligible_pct | cost_estimate |
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+
| scan        | openai/gpt-4o-mini | skipped/ineligible | 10 / 17 |   7 / 17 |            7 / 17 |                 41.2% |             - |
| chunk       | openai/gpt-4o-mini | zero_cost          | 10 / 17 |   7 / 17 |            7 / 17 |                 41.2% |             - |
| claims      | openai/gpt-4o-mini | billable           | 10 / 17 |   7 / 17 |            7 / 17 |                 41.2% |     $0.000792 |
| dedup       | openai/gpt-4o-mini | zero_cost          | 10 / 17 |   7 / 17 |            7 / 17 |                 41.2% |             - |
| geval_steps | openai/gpt-4o-mini | billable           | 10 / 17 |   7 / 17 |            3 / 17 |                 17.6% |     $0.000045 |
| relevance   | openai/gpt-4o-mini | billable           | 10 / 17 |   7 / 17 |            6 / 17 |                 35.3% |     $0.000168 |
| grounding   | openai/gpt-4o-mini | skipped/ineligible | 10 / 17 |   7 / 17 |            0 / 17 |                  0.0% |             - |
| redteam     | openai/gpt-4o-mini | billable           | 10 / 17 |   7 / 17 |            7 / 17 |                 41.2% |     $0.001789 |
| geval       | openai/gpt-4o-mini | billable           | 10 / 17 |   7 / 17 |            3 / 17 |                 17.6% |     $0.000583 |
| reference   | openai/gpt-4o-mini | zero_cost          | 10 / 17 |   7 / 17 |            4 / 17 |                 23.5% |             - |
| TOTAL       | -                  | -                  |       - |        - |                 - |                     - |     $0.003376 |
+-------------+--------------------+--------------------+---------+----------+-------------------+-----------------------+---------------+

3. Cache Delete

cache delete removes cached node outputs. Use --dry-run first to preview what would be deleted.

bash
# Preview cache cleanup first
nexagauge cache delete --dry-run
text
Files: 126   Size: 227.2 KB
Delete 126 files (227.2 KB)? [y/N]: y
Freed 227.2 KB (126 files)