Self-Hosted Endpoints (llama.cpp and OpenAI-Compatible APIs)
Overview
Use --host-model-url when your model is served by an OpenAI-compatible endpoint (local or remote).
This avoids extra env exports and auto-fills nexa-gauge routing for all branch nodes.
bash
nexagauge run eval \
--input sample.json \
--host-model-url http://localhost:8080/v1 \
--llm-concurrency 1 \
--max-in-flight 11) Install llama.cpp
macOS
bash
brew install llama.cppWindows
powershell
winget install llama.cppIf winget does not have the package on your machine, install from the official llama.cpp releases.
2) Run llama-server Locally
macOS / Linux
bash
llama-server -hf bartowski/Qwen2.5-7B-Instruct-GGUF -c 4096 --port 8080Windows (PowerShell)
powershell
llama-server -hf bartowski/Qwen2.5-7B-Instruct-GGUF -c 4096 --port 8080Server endpoint:
text
http://localhost:8080/v13) nexa-gauge Run Scripts
Grounding
bash
nexagauge run grounding \
--input sample.json \
--host-model-url http://localhost:8080/v1 \
--llm-concurrency 1 \
--max-in-flight 1 \
--output-dir ./report-groundingRelevance
bash
nexagauge run relevance \
--input sample.json \
--host-model-url http://localhost:8080/v1 \
--llm-concurrency 1 \
--max-in-flight 1 \
--output-dir ./report-relevanceFull Eval
bash
nexagauge run eval \
--input sample.json \
--host-model-url http://localhost:8080/v1 \
--llm-concurrency 1 \
--max-in-flight 1 \
--output-dir ./report4) Recommended Local Arguments
Use this safe baseline first:
--llm-concurrency 1--max-in-flight 1
Then scale up slowly after stability checks.
5) Important Notes
--host-model-urlmust point to an OpenAI-compatible base URL ending in/v1.- For localhost endpoints (
localhost,127.0.0.1,::1), nexa-gauge auto-usesapi_key=local. - For remote endpoints, pass
--host-model-api-key <key>when auth is required.