Client API Reference¶

The LlamaCppClient provides a comprehensive, type-safe Python client for all llama.cpp server endpoints. It supports both OpenAI-compatible APIs and native llama.cpp endpoints, with structured response dataclasses and optional SSE streaming.

Module: llamatelemetry.api.client

LlamaCppClient¶

Constructor¶

class LlamaCppClient:
    def __init__(
        self,
        base_url: str = "http://127.0.0.1:8080",
        api_key: Optional[str] = None,
        timeout: float = 600.0,
        verify_ssl: bool = True,
    )

Parameter	Type	Default	Description
`base_url`	`str`	`"http://127.0.0.1:8080"`	Server base URL
`api_key`	`Optional[str]`	`None`	API key for Bearer authentication
`timeout`	`float`	`600.0`	Request timeout in seconds
`verify_ssl`	`bool`	`True`	Verify SSL certificates

Sub-API Properties¶

Property	Type	Description
`client.chat`	`ChatCompletionsAPI`	OpenAI-compatible chat completions (`/v1/chat/completions`)
`client.embeddings`	`EmbeddingsClientAPI`	Embeddings API (`/v1/embeddings`)
`client.models`	`ModelsClientAPI`	Model management (`/v1/models`)
`client.slots`	`SlotsClientAPI`	Slot management (`/slots`)
`client.lora`	`LoraClientAPI`	LoRA adapter management (`/lora-adapters`)

from llamatelemetry.api.client import LlamaCppClient

client = LlamaCppClient("http://localhost:8080")

Health and Server Endpoints¶

health¶

def health(self) -> HealthStatus

Check server health. Returns a HealthStatus with status ("ok", "loading", or error), slots_idle, and slots_processing.

is_ready¶

def is_ready(self) -> bool

Returns True if the server status is "ok".

wait_until_ready¶

def wait_until_ready(self, timeout: float = 60.0, poll_interval: float = 1.0) -> bool

Block until the server is ready or timeout. Returns True if the server became ready.

props¶

def props(self) -> Dict[str, Any]

Get server global properties from /props. Returns dictionary with default_generation_settings, total_slots, model_path, chat_template, modalities, is_sleeping.

set_props¶

def set_props(self, **kwargs) -> Dict[str, Any]

Set server global properties (requires --props server flag).

metrics¶

def metrics(self) -> str

Get Prometheus-compatible metrics text from /metrics. Requires the --metrics server flag.

Chat Completion¶

chat_completion (convenience)¶

def chat_completion(
    self,
    messages: List[Dict[str, Any]],
    **kwargs,
) -> Union[CompletionResponse, Iterator[Dict[str, Any]]]

Convenience wrapper that delegates to client.chat.completions.create().

Parameter	Type	Default	Description
`messages`	`List[Dict[str, Any]]`	required	Chat messages (OpenAI format)
`**kwargs`			All parameters supported by `ChatCompletionsAPI.create()`

response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is CUDA?"},
    ],
    max_tokens=200,
    temperature=0.7,
)
print(response.choices[0].message.content)

chat.completions.create¶

def create(
    self,
    messages: List[Dict[str, Any]],
    model: str = "gpt-codex-5.3",
    max_tokens: Optional[int] = None,
    temperature: float = 1.0,
    top_p: float = 1.0,
    n: int = 1,
    stream: bool = False,
    stop: Optional[Union[str, List[str]]] = None,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
    logit_bias: Optional[Dict[str, float]] = None,
    user: Optional[str] = None,
    response_format: Optional[Dict[str, Any]] = None,
    tools: Optional[List[Dict[str, Any]]] = None,
    tool_choice: Optional[Union[str, Dict[str, Any]]] = None,
    seed: Optional[int] = None,
    # llama.cpp-specific extensions
    mirostat: int = 0,
    mirostat_tau: float = 5.0,
    mirostat_eta: float = 0.1,
    grammar: Optional[str] = None,
    min_p: float = 0.05,
    top_k: int = 40,
    repeat_penalty: float = 1.1,
    **kwargs,
) -> Union[CompletionResponse, Iterator[Dict[str, Any]]]

Full OpenAI-compatible chat completion endpoint with llama.cpp extensions.

Parameter	Type	Default	Description
`messages`	`List[Dict]`	required	Chat messages with `role` and `content`
`model`	`str`	`"gpt-codex-5.3"`	Model identifier
`max_tokens`	`Optional[int]`	`None`	Maximum tokens to generate
`temperature`	`float`	`1.0`	Sampling temperature (0.0-2.0)
`top_p`	`float`	`1.0`	Nucleus sampling (1.0 = disabled)
`n`	`int`	`1`	Number of completions to generate
`stream`	`bool`	`False`	Enable SSE streaming
`stop`	`Optional[Union[str, List[str]]]`	`None`	Stop sequences
`presence_penalty`	`float`	`0.0`	Presence penalty
`frequency_penalty`	`float`	`0.0`	Frequency penalty
`logit_bias`	`Optional[Dict[str, float]]`	`None`	Token logit biases
`response_format`	`Optional[Dict]`	`None`	Response format (`json_object`, `json_schema`)
`tools`	`Optional[List[Dict]]`	`None`	Tool/function definitions for function calling
`tool_choice`	`Optional[Union[str, Dict]]`	`None`	Tool selection mode
`seed`	`Optional[int]`	`None`	RNG seed for reproducibility
`grammar`	`Optional[str]`	`None`	BNF grammar for constrained generation (llama.cpp)
`min_p`	`float`	`0.05`	Min-p sampling (llama.cpp)
`top_k`	`int`	`40`	Top-k sampling (llama.cpp)
`repeat_penalty`	`float`	`1.1`	Repeat penalty (llama.cpp)
`mirostat`	`int`	`0`	Mirostat mode: 0=off, 1=v1, 2=v2 (llama.cpp)

Returns: CompletionResponse or streaming iterator when stream=True.

# Standard chat
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100,
    temperature=0.7,
)
print(response.choices[0].message.content)

# Streaming
for chunk in client.chat.completions.create(
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
    max_tokens=500,
):
    print(chunk.get("choices", [{}])[0].get("delta", {}).get("content", ""), end="")

# Structured output with JSON schema
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "List 3 colors"}],
    response_format={"type": "json_schema", "json_schema": {
        "name": "colors", "schema": {"type": "object", "properties": {
            "colors": {"type": "array", "items": {"type": "string"}}
        }}
    }},
)

Native Completion¶

complete¶

def complete(
    self,
    prompt: Union[str, List[Union[str, int]]],
    n_predict: int = -1,
    temperature: float = 0.8,
    top_k: int = 40,
    top_p: float = 0.95,
    min_p: float = 0.05,
    repeat_penalty: float = 1.1,
    repeat_last_n: int = 64,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
    mirostat: int = 0,
    mirostat_tau: float = 5.0,
    mirostat_eta: float = 0.1,
    grammar: Optional[str] = None,
    json_schema: Optional[Dict[str, Any]] = None,
    seed: int = -1,
    stop: Optional[List[str]] = None,
    stream: bool = False,
    cache_prompt: bool = True,
    n_probs: int = 0,
    samplers: Optional[List[str]] = None,
    dry_multiplier: float = 0.0,
    dry_base: float = 1.75,
    dry_allowed_length: int = 2,
    dry_penalty_last_n: int = -1,
    xtc_probability: float = 0.0,
    xtc_threshold: float = 0.1,
    dynatemp_range: float = 0.0,
    dynatemp_exponent: float = 1.0,
    typical_p: float = 1.0,
    id_slot: int = -1,
    return_tokens: bool = False,
    **kwargs,
) -> Union[CompletionResponse, Iterator[Dict[str, Any]]]

Native llama.cpp completion endpoint (/completion) with full access to all sampling parameters.

Parameter	Type	Default	Description
`prompt`	`Union[str, List]`	required	Input prompt (string or token array)
`n_predict`	`int`	`-1`	Max tokens to generate (-1 = unlimited)
`temperature`	`float`	`0.8`	Sampling temperature
`top_k`	`int`	`40`	Top-k sampling (0 = disabled)
`top_p`	`float`	`0.95`	Nucleus sampling (1.0 = disabled)
`min_p`	`float`	`0.05`	Min-p sampling (0.0 = disabled)
`repeat_penalty`	`float`	`1.1`	Repetition penalty
`grammar`	`Optional[str]`	`None`	BNF grammar for constrained generation
`json_schema`	`Optional[Dict]`	`None`	JSON schema for structured output
`seed`	`int`	`-1`	RNG seed (-1 = random)
`stream`	`bool`	`False`	Enable streaming
`cache_prompt`	`bool`	`True`	Reuse KV cache from previous request
`n_probs`	`int`	`0`	Return top N token probabilities
`samplers`	`Optional[List[str]]`	`None`	Custom sampler order
`dry_multiplier`	`float`	`0.0`	DRY sampling multiplier (0 = disabled)
`xtc_probability`	`float`	`0.0`	XTC sampling probability (0 = disabled)
`dynatemp_range`	`float`	`0.0`	Dynamic temperature range (0 = disabled)
`typical_p`	`float`	`1.0`	Locally typical sampling (1.0 = disabled)
`id_slot`	`int`	`-1`	Specific slot ID (-1 = auto)
`return_tokens`	`bool`	`False`	Return raw token IDs

response = client.complete(
    prompt="The capital of France is",
    n_predict=50,
    temperature=0.7,
)
print(response.choices[0].text)

simple_completion¶

def simple_completion(self, prompt: str, **kwargs) -> Union[str, CompletionResponse, Iterator]

Convenience wrapper that returns the generated text string when possible, falling back to the full CompletionResponse object.

Embeddings¶

embeddings.create (OpenAI-compatible)¶

def create(
    self,
    input: Union[str, List[str]],
    model: str = "text-embedding-ada-002",
    encoding_format: str = "float",
    dimensions: Optional[int] = None,
) -> EmbeddingsResponse

Parameter	Type	Default	Description
`input`	`Union[str, List[str]]`	required	Text(s) to embed
`model`	`str`	`"text-embedding-ada-002"`	Model identifier
`encoding_format`	`str`	`"float"`	Output format (`float` or `base64`)
`dimensions`	`Optional[int]`	`None`	Embedding dimensions

Returns: EmbeddingsResponse with data (list of EmbeddingData), model, and usage.

embed (native)¶

def embed(
    self,
    content: Union[str, List[str]],
    embd_normalize: int = 2,
) -> List[List[float]]

Native embedding endpoint (/embedding). Returns raw embedding vectors. Normalization types: -1=none, 0=max absolute, 1=taxicab, 2=Euclidean (L2).

Tokenization¶

tokenize¶

def tokenize(
    self,
    content: str,
    add_special: bool = False,
    parse_special: bool = True,
    with_pieces: bool = False,
) -> TokenizeResponse

Parameter	Type	Default	Description
`content`	`str`	required	Text to tokenize
`add_special`	`bool`	`False`	Add BOS/EOS tokens
`parse_special`	`bool`	`True`	Parse special token syntax
`with_pieces`	`bool`	`False`	Return token string pieces alongside IDs

Returns: TokenizeResponse with tokens list.

detokenize¶

def detokenize(self, tokens: List[int]) -> str

Convert token IDs back to text. Returns the decoded string.

tokens = client.tokenize("Hello, world!")
print(tokens.tokens)  # [15496, 11, 995, 0]
text = client.detokenize(tokens.tokens)
print(text)  # "Hello, world!"

Reranking¶

rerank¶

def rerank(
    self,
    query: str,
    documents: List[str],
    top_n: Optional[int] = None,
) -> RerankResponse

Rerank documents by relevance to a query. Requires a reranker model loaded with --embedding --pooling rank server flags.

Parameter	Type	Default	Description
`query`	`str`	required	Query string
`documents`	`List[str]`	required	Documents to rank
`top_n`	`Optional[int]`	`None`	Return only top N results

Returns: RerankResponse with results (list of RerankResult with index, relevance_score, document).

results = client.rerank(
    query="What is a panda?",
    documents=["A panda is a bear", "Hello world", "Pandas eat bamboo"],
    top_n=2,
)
for r in results.results:
    print(f"  [{r.index}] score={r.relevance_score:.3f}")

Code Infill¶

infill¶

def infill(
    self,
    input_prefix: str,
    input_suffix: str,
    input_extra: Optional[List[Dict[str, str]]] = None,
    prompt: Optional[str] = None,
    stream: bool = False,
    **kwargs,
) -> Union[CompletionResponse, Iterator[Dict[str, Any]]]

Fill-in-the-middle code completion. Requires a model that supports FIM tokens.

Parameter	Type	Default	Description
`input_prefix`	`str`	required	Code before the cursor
`input_suffix`	`str`	required	Code after the cursor
`input_extra`	`Optional[List[Dict]]`	`None`	Additional context files
`prompt`	`Optional[str]`	`None`	Text added after FIM_MID token
`stream`	`bool`	`False`	Enable streaming

Slot Management¶

slots.list¶

def list(self, fail_on_no_slot: bool = False) -> List[SlotInfo]

List server slots. Each SlotInfo has: id, is_processing, n_ctx, n_predict, params, prompt.

slots.save / slots.restore / slots.erase¶

def save(self, slot_id: int, filename: str) -> Dict[str, Any]
def restore(self, slot_id: int, filename: str) -> Dict[str, Any]
def erase(self, slot_id: int) -> Dict[str, Any]

Save, restore, or erase the KV cache for a specific slot.

LoRA Adapter Management¶

lora.list¶

def list(self) -> List[LoraAdapter]

List loaded LoRA adapters. Each LoraAdapter has: id, path, scale.

lora.set_scales¶

def set_scales(self, adapters: List[Dict[str, Any]]) -> bool

Set LoRA adapter scales at runtime.

client.lora.set_scales([
    {"id": 0, "scale": 0.5},
    {"id": 1, "scale": 0.8},
])

Response Dataclasses¶

CompletionResponse¶

Field	Type	Description
`id`	`str`	Response ID
`object`	`str`	Object type (`"text_completion"` or `"chat.completion"`)
`created`	`int`	Unix timestamp
`model`	`str`	Model identifier
`choices`	`List[Choice]`	List of completion choices
`usage`	`Optional[Usage]`	Token usage statistics
`timings`	`Optional[Timings]`	Performance timings
`system_fingerprint`	`Optional[str]`	System fingerprint

Choice¶

Field	Type	Description
`index`	`int`	Choice index
`message`	`Optional[Message]`	Chat message (for chat completions)
`text`	`Optional[str]`	Generated text (for native completions)
`finish_reason`	`Optional[str]`	Stop reason (`"stop"`, `"length"`, etc.)

Message¶

Field	Type	Description
`role`	`str`	Message role (`"assistant"`, `"user"`, `"system"`)
`content`	`str`	Message content
`tool_calls`	`Optional[List[Dict]]`	Tool call results
`reasoning_content`	`Optional[str]`	Reasoning content (if supported)

Usage¶

Field	Type	Description
`prompt_tokens`	`int`	Input token count
`completion_tokens`	`int`	Output token count
`total_tokens`	`int`	Total token count

Timings¶

Field	Type	Description
`prompt_n`	`int`	Tokens evaluated
`prompt_ms`	`float`	Prompt evaluation time (ms)
`prompt_per_token_ms`	`float`	Time per prompt token (ms)
`prompt_per_second`	`float`	Prompt tokens per second
`predicted_n`	`int`	Tokens predicted
`predicted_ms`	`float`	Prediction time (ms)
`predicted_per_token_ms`	`float`	Time per predicted token (ms)
`predicted_per_second`	`float`	Prediction tokens per second
`cache_n`	`int`	Cached tokens reused

Other Dataclasses¶

Class	Fields	Description
`EmbeddingsResponse`	`object`, `data`, `model`, `usage`	Embeddings API response
`EmbeddingData`	`index`, `embedding`, `object`	Single embedding vector
`RerankResponse`	`model`, `results`, `usage`	Rerank API response
`RerankResult`	`index`, `relevance_score`, `document`	Single rerank result
`TokenizeResponse`	`tokens`	Tokenization result
`HealthStatus`	`status`, `slots_idle`, `slots_processing`	Health check response
`SlotInfo`	`id`, `is_processing`, `n_ctx`, `n_predict`, `params`, `prompt`	Slot status
`LoraAdapter`	`id`, `path`, `scale`	LoRA adapter info
`StopType`	Enum: `NONE`, `EOS`, `LIMIT`, `WORD`	Completion stop types

Core API -- High-level InferenceEngine wraps this client
Telemetry API -- InstrumentedLlamaCppClient for auto-tracing
Server and Models -- Server lifecycle management