Self-Hosted Endpoints (llama.cpp and OpenAI-Compatible APIs)

Overview

Use --host-model-url when your model is served by an OpenAI-compatible endpoint (local or remote).

This avoids extra env exports and auto-fills nexa-gauge routing for all branch nodes.

bash
nexagauge run eval \
  --input sample.json \
  --host-model-url http://localhost:8080/v1 \
  --llm-concurrency 1 \
  --max-in-flight 1

1) Install llama.cpp

macOS

bash
brew install llama.cpp

Windows

powershell
winget install llama.cpp

If winget does not have the package on your machine, install from the official llama.cpp releases.

2) Run llama-server Locally

macOS / Linux

bash
llama-server -hf bartowski/Qwen2.5-7B-Instruct-GGUF -c 4096 --port 8080

Windows (PowerShell)

powershell
llama-server -hf bartowski/Qwen2.5-7B-Instruct-GGUF -c 4096 --port 8080

Server endpoint:

text
http://localhost:8080/v1

3) nexa-gauge Run Scripts

Grounding

bash
nexagauge run grounding \
  --input sample.json \
  --host-model-url http://localhost:8080/v1 \
  --llm-concurrency 1 \
  --max-in-flight 1 \
  --output-dir ./report-grounding

Relevance

bash
nexagauge run relevance \
  --input sample.json \
  --host-model-url http://localhost:8080/v1 \
  --llm-concurrency 1 \
  --max-in-flight 1 \
  --output-dir ./report-relevance

Full Eval

bash
nexagauge run eval \
  --input sample.json \
  --host-model-url http://localhost:8080/v1 \
  --llm-concurrency 1 \
  --max-in-flight 1 \
  --output-dir ./report

4) Recommended Local Arguments

Use this safe baseline first:

  • --llm-concurrency 1
  • --max-in-flight 1

Then scale up slowly after stability checks.

5) Important Notes

  • --host-model-url must point to an OpenAI-compatible base URL ending in /v1.
  • For localhost endpoints (localhost, 127.0.0.1, ::1), nexa-gauge auto-uses api_key=local.
  • For remote endpoints, pass --host-model-api-key <key> when auth is required.