Self-Hosted Endpoints (llama.cpp and OpenAI-Compatible APIs)

Overview

Use --host-model-url when your model is served by an OpenAI-compatible endpoint (local or remote).

This avoids extra env exports and auto-fills nexa-gauge routing for all branch nodes.

bash

nexagauge run eval \
  --input sample.json \
  --host-model-url http://localhost:8080/v1 \
  --llm-concurrency 1 \
  --max-in-flight 1

1) Install llama.cpp

macOS

bash

brew install llama.cpp

Windows

powershell

winget install llama.cpp

If winget does not have the package on your machine, install from the official llama.cpp releases.

2) Run llama-server Locally

macOS / Linux

bash

llama-server -hf bartowski/Qwen2.5-7B-Instruct-GGUF -c 4096 --port 8080

Windows (PowerShell)

powershell

llama-server -hf bartowski/Qwen2.5-7B-Instruct-GGUF -c 4096 --port 8080

Server endpoint:

text

http://localhost:8080/v1

3) nexa-gauge Run Scripts

Grounding

bash

nexagauge run grounding \
  --input sample.json \
  --host-model-url http://localhost:8080/v1 \
  --llm-concurrency 1 \
  --max-in-flight 1 \
  --output-dir ./report-grounding

Relevance

bash

nexagauge run relevance \
  --input sample.json \
  --host-model-url http://localhost:8080/v1 \
  --llm-concurrency 1 \
  --max-in-flight 1 \
  --output-dir ./report-relevance

Full Eval

bash

nexagauge run eval \
  --input sample.json \
  --host-model-url http://localhost:8080/v1 \
  --llm-concurrency 1 \
  --max-in-flight 1 \
  --output-dir ./report

4) Recommended Local Arguments

Use this safe baseline first:

--llm-concurrency 1
--max-in-flight 1

Then scale up slowly after stability checks.

5) Important Notes

--host-model-url must point to an OpenAI-compatible base URL ending in /v1.
For localhost endpoints (localhost, 127.0.0.1, ::1), nexa-gauge auto-uses api_key=local.
For remote endpoints, pass --host-model-api-key <key> when auth is required.