Server Management¶
ServerManager handles the full lifecycle of the llama-server backend process: binary discovery, bootstrapping, startup with GPU-aware configuration, health monitoring, metrics collection, and graceful shutdown. It is used internally by InferenceEngine but is also available for direct control.
Overview¶
The server management layer provides:
- Binary discovery -- locates
llama-serverin standard paths, environment variables, or the bootstrap cache - Auto-bootstrap -- downloads a pre-built release bundle (~961 MB) if no binary is found
- GPU-aware startup -- configures GPU layers, context size, batch parameters, and multi-GPU splits
- Health monitoring -- readiness polling, health checks, and llama.cpp metrics endpoint
- Lifecycle management -- start, stop, restart, and process supervision
Creating a ServerManager¶
from llamatelemetry.server import ServerManager
# Default -- targets http://127.0.0.1:8080
manager = ServerManager()
# Custom URL
manager = ServerManager(server_url="http://127.0.0.1:9000")
Finding the llama-server Binary¶
Before starting a server, the manager must locate the llama-server binary:
server_path = manager.find_llama_server()
if server_path:
print(f"Found llama-server at: {server_path}")
else:
print("Binary not found -- bootstrap will download it")
Search Order¶
The manager searches for the binary in this order:
LLAMA_SERVER_PATHenvironment variable (explicit path)LLAMA_CPP_DIRenvironment variable (custom build directory)- The llamatelemetry bootstrap cache (
~/.cache/llamatelemetry/) - Common system paths (
/usr/local/bin, etc.)
If no binary is found and auto_start=True is passed to load_model(), the bootstrap system downloads the appropriate release bundle.
Starting a Server¶
manager.start_server(
model_path="/path/to/model.gguf",
port=8080,
host="127.0.0.1",
gpu_layers=99,
ctx_size=2048,
n_parallel=1,
batch_size=512,
ubatch_size=128,
enable_metrics=True,
enable_props=True,
enable_slots=True,
)
Start Parameters Reference¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model_path |
str |
required | Path to the GGUF model file |
port |
int |
8080 |
HTTP port for the server |
host |
str |
"127.0.0.1" |
Bind address |
gpu_layers |
int |
99 |
Number of layers offloaded to GPU |
ctx_size |
int |
2048 |
Context window size in tokens |
n_parallel |
int |
1 |
Number of parallel inference slots |
batch_size |
int |
512 |
Logical batch size |
ubatch_size |
int |
128 |
Physical micro-batch size |
enable_metrics |
bool |
True |
Enable the /metrics Prometheus endpoint |
enable_props |
bool |
True |
Enable the /props endpoint |
enable_slots |
bool |
True |
Enable the /slots endpoint |
multi_gpu_config |
MultiGPUConfig |
None |
Multi-GPU configuration |
nccl_config |
NCCLConfig |
None |
NCCL communication config |
Multi-GPU Startup¶
Pass a MultiGPUConfig for tensor-parallel or layer-split inference:
from llamatelemetry.api.multigpu import MultiGPUConfig, SplitMode
multi_gpu = MultiGPUConfig(
n_gpu_layers=-1, # All layers on GPU
split_mode=SplitMode.LAYER, # Layer-based split
tensor_split=[0.5, 0.5], # Equal split across 2 GPUs
flash_attention=True,
)
manager.start_server(
model_path="/path/to/model.gguf",
multi_gpu_config=multi_gpu,
)
Waiting for Readiness¶
After starting the server, wait until it finishes loading the model and is ready for requests:
manager.start_server(model_path="/path/to/model.gguf")
# Block until the server responds to health checks
ready = manager.wait_ready(timeout=120)
if ready:
print("Server is ready for inference")
else:
print("Server failed to start within timeout")
The default timeout is generous enough for large models on Tesla T4 GPUs. The method polls the /health endpoint at regular intervals.
Health and Monitoring¶
Health Check¶
# Simple boolean check
is_healthy = manager.check_server_health()
print(f"Healthy: {is_healthy}")
# Detailed health response
health = manager.get_health()
print(health)
# Example: {"status": "ok", "slots_idle": 1, "slots_processing": 0}
Server Properties¶
props = manager.get_props()
print(f"Model: {props.get('default_generation_settings', {}).get('model')}")
print(f"Context size: {props.get('default_generation_settings', {}).get('n_ctx')}")
Slot Information¶
slots = manager.get_slots()
for slot in slots:
print(f"Slot {slot['id']}: state={slot['state']}, task={slot.get('task_id')}")
Prometheus Metrics¶
metrics_text = manager.get_metrics()
print(metrics_text)
# Returns Prometheus-format text with llama.cpp internal metrics:
# - prompt_tokens_total
# - generation_tokens_total
# - prompt_seconds_total
# - generation_seconds_total
# - kv_cache_usage_ratio
Model Information¶
Server Info¶
info = manager.get_server_info()
print(f"Version: {info.get('version')}")
print(f"Build: {info.get('build_info')}")
Stopping and Restarting¶
Graceful Shutdown¶
This sends a termination signal to the llama-server process and waits for it to exit. If the process does not exit within a timeout, it is forcefully killed.
Restart¶
Restart is equivalent to stop_server() followed by start_server() with the same parameters. This is useful for reloading a model or recovering from a crashed server.
Using with Presets¶
ServerManager works well with the Kaggle preset system:
from llamatelemetry.kaggle.presets import get_preset_config, ServerPreset
preset = get_preset_config(ServerPreset.KAGGLE_DUAL_T4)
manager = ServerManager()
# Convert preset to start_server kwargs
manager.start_server(
model_path="/path/to/model.gguf",
**preset.to_server_kwargs(),
)
See the Kaggle Environment guide for more preset details.
Environment Variables¶
| Variable | Description |
|---|---|
LLAMA_SERVER_PATH |
Explicit path to the llama-server binary |
LLAMA_CPP_DIR |
Directory containing a custom llama.cpp build |
LD_LIBRARY_PATH |
Auto-populated by the bootstrap to include bundled CUDA libraries |
# Example: point to a custom build
export LLAMA_SERVER_PATH=/opt/llama.cpp/build/bin/llama-server
export LD_LIBRARY_PATH=/opt/llama.cpp/build/lib:$LD_LIBRARY_PATH
Process Management Details¶
The ServerManager spawns llama-server as a subprocess. Key behaviors:
- stdout/stderr are captured and available for debugging
- The server process is terminated when the manager is garbage-collected
- If the Python process exits unexpectedly, the server may remain running -- use
lsof -i :8080to find and kill orphaned processes - On Kaggle notebooks, processes are automatically cleaned up when the session ends
Best Practices¶
- Use
InferenceEnginefor most workflows -- it handlesServerManagerinternally. - Call
wait_ready()afterstart_server()to avoid race conditions. - Enable metrics (
enable_metrics=True) for production monitoring. - Use presets on Kaggle to get optimal settings for T4 GPUs automatically.
- Set
n_parallel > 1only if you need concurrent inference slots. - Check health periodically in long-running services to detect crashes.
Complete Example¶
from llamatelemetry.server import ServerManager
from llamatelemetry.api.multigpu import MultiGPUConfig, SplitMode
manager = ServerManager(server_url="http://127.0.0.1:8080")
# Find or bootstrap the binary
server_path = manager.find_llama_server()
print(f"Using: {server_path}")
# Start with multi-GPU config
manager.start_server(
model_path="/models/gemma-3-1b-Q4_K_M.gguf",
gpu_layers=99,
ctx_size=4096,
n_parallel=2,
enable_metrics=True,
)
# Wait for readiness
manager.wait_ready(timeout=120)
# Monitor
print(f"Health: {manager.check_server_health()}")
print(f"Slots: {manager.get_slots()}")
print(f"Metrics:\n{manager.get_metrics()}")
# Cleanup
manager.stop_server()
Related¶
- Inference Engine -- high-level API that wraps ServerManager
- Kaggle Environment -- preset configurations
- API Client -- HTTP client for the running server
- Server and Models API Reference