Server and Models API Reference¶

This page documents the server lifecycle management and model discovery systems in llamatelemetry. ServerManager handles finding, starting, and stopping the llama-server process. The models module provides smart model loading from registries, HuggingFace, and local paths.

Modules: llamatelemetry.server, llamatelemetry.models

ServerManager¶

Manages the llama-server process lifecycle: locating the binary, starting with appropriate parameters, health checking, and graceful shutdown. Supports context manager protocol.

Constructor¶

class ServerManager:
    def __init__(self, server_url: str = "http://127.0.0.1:8080")

Parameter	Type	Default	Description
`server_url`	`str`	`"http://127.0.0.1:8080"`	URL where the server will be accessible

Context Manager¶

with ServerManager() as manager:
    manager.start_server(model_path="/path/to/model.gguf", gpu_layers=99)
    # Server is running...
# Server is automatically stopped on exit

find_llama_server¶

def find_llama_server(self) -> Optional[Path]

Locate the llama-server executable. Searches in this order:

LLAMA_SERVER_PATH environment variable
Package bootstrap directory (llamatelemetry/binaries/)
LLAMA_CPP_DIR environment variable
Cache directory (~/.cache/llamatelemetry/)
Repository development directories
System PATH
Downloads a fresh binary bundle as last resort

Returns: Path to the executable, or None if not found.

start_server¶

def start_server(
    self,
    model_path: str,
    port: int = 8080,
    host: str = "127.0.0.1",
    gpu_layers: int = 99,
    ctx_size: int = 2048,
    n_parallel: int = 1,
    batch_size: int = 512,
    ubatch_size: int = 128,
    timeout: int = 60,
    verbose: bool = True,
    skip_gpu_check: bool = False,
    silent: bool = False,
    multi_gpu_config: Optional[Any] = None,
    nccl_config: Optional[Any] = None,
    enable_metrics: bool = False,
    enable_props: bool = False,
    enable_slots: bool = True,
    **kwargs,
) -> bool

Start llama-server with the specified configuration.

Parameter	Type	Default	Description
`model_path`	`str`	required	Path to GGUF model file
`port`	`int`	`8080`	Server port
`host`	`str`	`"127.0.0.1"`	Server bind address
`gpu_layers`	`int`	`99`	Number of layers to offload to GPU
`ctx_size`	`int`	`2048`	Context window size
`n_parallel`	`int`	`1`	Number of parallel sequences
`batch_size`	`int`	`512`	Logical maximum batch size
`ubatch_size`	`int`	`128`	Physical maximum batch size
`timeout`	`int`	`60`	Max seconds to wait for server startup
`verbose`	`bool`	`True`	Print status messages
`skip_gpu_check`	`bool`	`False`	Skip GPU compatibility check
`silent`	`bool`	`False`	Suppress all server output
`multi_gpu_config`	`Optional[MultiGPUConfig]`	`None`	Multi-GPU configuration object
`nccl_config`	`Optional[NCCLConfig]`	`None`	NCCL configuration for environment setup
`enable_metrics`	`bool`	`False`	Enable `/metrics` Prometheus endpoint
`enable_props`	`bool`	`False`	Enable `/props` endpoint
`enable_slots`	`bool`	`True`	Enable `/slots` endpoint
`**kwargs`			Additional server arguments (see below)

Additional kwargs are mapped to llama-server CLI flags: flash_attn, main_gpu, split_mode, tensor_split, no_mmap, mlock, threads, cache_type_k, cache_type_v, lora, rope_scaling, and 60+ more parameters.

Boolean toggle kwargs support true/false pairs: perf, escape, op_offload, kv_offload, repack, warmup, context_shift, cont_batching, webui.

Returns: True if the server started successfully.

Raises: FileNotFoundError, RuntimeError

manager = ServerManager(server_url="http://127.0.0.1:8080")
manager.start_server(
    model_path="/models/gemma-3-1b-it-Q4_K_M.gguf",
    gpu_layers=99,
    ctx_size=4096,
    flash_attn=True,
    enable_metrics=True,
)

stop_server¶

def stop_server(self, timeout: float = 10.0) -> bool

Stop the running llama-server. Sends SIGTERM first, then SIGKILL if the graceful shutdown times out.

Parameter	Type	Default	Description
`timeout`	`float`	`10.0`	Max seconds to wait for graceful shutdown

Returns: True if the server stopped successfully.

Health and Status Methods¶

Method	Signature	Description
`check_server_health(timeout=2.0)`	`-> bool`	Check if the server responds to `/health`
`get_health(timeout=2.0)`	`-> Dict[str, Any]`	Get detailed health JSON (returns `{"status": "loading"}` on 503, `{"status": "unavailable"}` on error)
`wait_ready(timeout=60.0, interval=1.0)`	`-> bool`	Block until the server is healthy or timeout
`get_metrics(timeout=5.0)`	`-> Optional[str]`	Fetch Prometheus metrics text from `/metrics`
`get_models(timeout=5.0)`	`-> Any`	Fetch loaded models from `/v1/models` (OpenAI-compatible)
`get_slots(timeout=5.0)`	`-> Any`	Fetch slot status from `/slots`
`get_props(timeout=5.0)`	`-> Any`	Fetch server properties from `/props`
`get_server_info()`	`-> Dict[str, Any]`	Get local server info: `running`, `url`, `process_id`, `executable`

manager = ServerManager()
manager.start_server("/path/to/model.gguf", enable_metrics=True)

# Check health
print(manager.check_server_health())  # True

# Get Prometheus metrics
metrics_text = manager.get_metrics()

# Get slot status
slots = manager.get_slots()

# Get loaded model info
models = manager.get_models()

restart_server¶

def restart_server(self, model_path: str, **kwargs) -> bool

Stop the current server and restart with new configuration. Accepts the same parameters as start_server.

start_from_preset¶

def start_from_preset(self, model_path: str, preset_name: Any = None, **kwargs) -> bool

Start llama-server using a Kaggle preset configuration.

Parameter	Type	Default	Description
`model_path`	`str`	required	Path to GGUF model
`preset_name`	`str` or `ServerPreset`	`AUTO`	Preset name (`"FAST"`, `"QUALITY"`, `"AUTO"`)
`**kwargs`			Override values for the preset

ModelInfo¶

GGUF model information extractor. Reads metadata from GGUF file headers to provide model information and recommend inference settings.

Constructor¶

class ModelInfo:
    def __init__(self, filepath: str)

Parameter	Type	Default	Description
`filepath`	`str`	required	Path to GGUF model file

Attributes¶

Attribute	Type	Description
`filepath`	`Path`	Path to the model file
`file_size_mb`	`float`	File size in megabytes
`architecture`	`Optional[str]`	Model architecture (e.g., `"llama"`, `"gemma"`)
`parameter_count`	`Optional[int]`	Estimated parameter count
`context_length`	`Optional[int]`	Maximum context length
`embedding_length`	`Optional[int]`	Embedding dimension
`quantization`	`Optional[str]`	Quantization type
`metadata`	`Dict[str, Any]`	Raw GGUF metadata dictionary

Methods¶

Method	Signature	Description
`from_file(filepath)`	`@classmethod -> ModelInfo`	Create from file path
`get_recommended_settings(vram_gb=8.0)`	`-> Dict[str, Any]`	Get recommended `gpu_layers`, `ctx_size`, `batch_size`, `ubatch_size` for given VRAM
`to_dict()`	`-> Dict[str, Any]`	Convert to serializable dictionary

from llamatelemetry.models import ModelInfo
info = ModelInfo.from_file("/models/gemma-3-1b-it-Q4_K_M.gguf")
print(f"Architecture: {info.architecture}")
print(f"Size: {info.file_size_mb:.1f} MB")
settings = info.get_recommended_settings(vram_gb=15.0)
print(f"Recommended GPU layers: {settings['gpu_layers']}")
print(f"Recommended context: {settings['ctx_size']}")

ModelManager¶

Manages a collection of models with scanning, filtering, and selection.

Constructor¶

class ModelManager:
    def __init__(self, directories: Optional[List[str]] = None)

Methods¶

Method	Signature	Description
`scan_directories(directories)`	`-> None`	Scan directories for `*.gguf` files recursively
`find_by_size(min_mb=0, max_mb=inf)`	`-> List[ModelInfo]`	Filter models by file size range
`find_by_architecture(architecture)`	`-> List[ModelInfo]`	Filter models by architecture name (case-insensitive)
`get_best_for_vram(vram_gb)`	`-> Optional[ModelInfo]`	Get the largest model that fits in the given VRAM (70% safety margin)

from llamatelemetry.models import ModelManager
manager = ModelManager(directories=["/kaggle/working/models"])
models = manager.find_by_size(max_mb=4000)  # Models under 4GB
best = manager.get_best_for_vram(vram_gb=15.0)
if best:
    print(f"Best model: {best.filepath.name} ({best.file_size_mb:.0f} MB)")

Module-Level Functions¶

load_model_smart¶

def load_model_smart(
    model_name_or_path: str,
    cache_dir: Optional[Path] = None,
    interactive: bool = True,
    force_download: bool = False,
) -> Path

Smart model loader with auto-download and confirmation. Handles three cases:

Local path exists -- returns the path directly
Registry name (e.g., "gemma-3-1b-Q4_K_M") -- downloads from HuggingFace with optional confirmation
HuggingFace syntax ("repo/id:filename.gguf") -- downloads directly

Parameter	Type	Default	Description
`model_name_or_path`	`str`	required	Model name, local path, or `"repo:file"` syntax
`cache_dir`	`Optional[Path]`	`None`	Cache directory (default: `llamatelemetry/models/`)
`interactive`	`bool`	`True`	Ask for download confirmation
`force_download`	`bool`	`False`	Re-download even if cached

Returns: Path to the model file, or None if the user cancelled.

Raises: ValueError, FileNotFoundError, ImportError, RuntimeError

from llamatelemetry.models import load_model_smart

# From registry
path = load_model_smart("gemma-3-1b-Q4_K_M")

# Local path
path = load_model_smart("/path/to/model.gguf")

# HuggingFace syntax
path = load_model_smart("unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf")

download_model¶

def download_model(
    repo_id: str,
    filename: str,
    output_dir: Optional[str] = None,
    show_progress: bool = True,
) -> str

Download a GGUF model directly from HuggingFace Hub. Requires huggingface_hub.

Parameter	Type	Default	Description
`repo_id`	`str`	required	HuggingFace repository ID
`filename`	`str`	required	Filename to download
`output_dir`	`Optional[str]`	`None`	Output directory (default: current directory)
`show_progress`	`bool`	`True`	Show download progress

Returns: Path to the downloaded model file.

list_models¶

def list_models(directories: Optional[List[str]] = None) -> List[Dict[str, Any]]

List all GGUF models in specified directories. Returns model information dictionaries sorted by file size (descending). Each dictionary contains: filepath, filename, file_size_mb, architecture, parameter_count, context_length, quantization, metadata.

list_registry_models¶

def list_registry_models() -> Dict[str, Dict[str, Any]]

List all models available in the built-in registry (30+ models). Returns a dictionary mapping model name to info dict with keys: repo, file, description, size_mb, min_vram_gb.

from llamatelemetry.models import list_registry_models
for name, info in list_registry_models().items():
    print(f"{name}: {info['size_mb']}MB - {info['description']}")

get_model_recommendations¶

def get_model_recommendations(vram_gb: float = 8.0) -> List[Dict[str, str]]

Get recommended models based on available VRAM. Returns a list of dictionaries with keys: name, repo, file, size_gb, description. Recommendations are tiered by VRAM: 24GB+, 12GB+, 6GB+, 3GB+, and under 3GB.

SmartModelDownloader¶

Smart model downloader with VRAM validation and quantization recommendations.

Constructor¶

class SmartModelDownloader:
    def __init__(
        self,
        vram_gb: Optional[float] = None,
        cache_dir: Optional[Path] = None,
        auto_recommend: bool = True,
    )

Parameter	Type	Default	Description
`vram_gb`	`Optional[float]`	`None`	Available VRAM in GB (auto-detected if `None`)
`cache_dir`	`Optional[Path]`	`None`	Model cache directory
`auto_recommend`	`bool`	`True`	Suggest alternatives if model too large

Methods¶

Method	Signature	Description
`validate_model(model_name)`	`-> Dict[str, Any]`	Check if a model fits in VRAM. Returns `fits`, `estimated_vram_gb`, `alternative_models`, `warning`
`download(model_name_or_path, force=False, warn_on_large=True, interactive=True)`	`-> Optional[Path]`	Download with VRAM validation and warnings
`get_recommendations(max_size_gb=None, min_quality="Q4_K_M")`	`-> List[Dict]`	Get top 10 recommended models for available VRAM

from llamatelemetry.models import SmartModelDownloader
downloader = SmartModelDownloader(vram_gb=15.0)

result = downloader.validate_model("gemma-3-12b-Q4_K_M")
if not result["fits"]:
    print(f"Try: {result['alternative_models']}")

path = downloader.download("gemma-3-4b-Q4_K_M")

Core API -- InferenceEngine uses these internally
GGUF API -- Detailed GGUF parsing and analysis
Client API -- Direct server communication

Server and Models API Reference¶

ServerManager¶

Constructor¶

Context Manager¶

find_llama_server¶

start_server¶

stop_server¶

Health and Status Methods¶

restart_server¶

start_from_preset¶

ModelInfo¶

Constructor¶

Attributes¶

Methods¶

ModelManager¶

Constructor¶

Methods¶

Module-Level Functions¶

load_model_smart¶

download_model¶

list_models¶

list_registry_models¶

get_model_recommendations¶

SmartModelDownloader¶

Constructor¶

Methods¶

Related Modules¶