Skip to content

API Reference

Complete API documentation for llamatelemetry v0.1.0.


Core APIs

ServerManager

Manage llama-server lifecycle:

from llamatelemetry.server import ServerManager

server = ServerManager()
server.start_server(
    model_path="/path/to/model.gguf",
    gpu_layers=99,
    tensor_split="1.0,0.0",
    flash_attn=1,
)
server.stop_server()

ServerManager API


LlamaCppClient

OpenAI-compatible client:

from llamatelemetry.api.client import LlamaCppClient

client = LlamaCppClient(base_url="http://127.0.0.1:8080")

# Chat completion
response = client.chat.create(
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=50
)

# Streaming
for chunk in client.chat.create(messages=[...], stream=True):
    print(chunk.choices[0].delta.content, end="")

Client API


Telemetry Setup

Initialize OpenTelemetry:

from llamatelemetry.telemetry import setup_telemetry

tracer, meter = setup_telemetry(
    service_name="my-service",
    otlp_endpoint="http://localhost:4317"
)

# Use tracer
with tracer.start_as_current_span("operation") as span:
    # Your code
    span.set_attribute("key", "value")

Telemetry API


Graphistry Integration

Graph visualization:

from llamatelemetry.graphistry import TracesGraphistry

g = TracesGraphistry(spans=collected_spans)
g.plot(
    render=True,
    point_title="span_name",
    point_color="duration_ms"
)

Graphistry API


Multi-GPU Utilities

GPU configuration and monitoring:

from llamatelemetry.api.multigpu import (
    gpu_count,
    kaggle_t4_dual_config,
    get_gpu_info
)

# Get GPU count
print(f"GPUs: {gpu_count()}")

# Get optimized config
config = kaggle_t4_dual_config()

# Get GPU info
info = get_gpu_info(gpu_id=0)
print(f"VRAM: {info['memory_total_mb']} MB")

Multi-GPU API


API Categories

Inference

Observability

Visualization

Multi-GPU


Quick Reference

Common Patterns

Basic Inference

server = ServerManager()
server.start_server(model_path=path)
client = LlamaCppClient()
response = client.chat.create(messages=[...])
server.stop_server()

With Telemetry

tracer, _ = setup_telemetry(service_name="llm")
server = ServerManager()
server.start_server(model_path=path)
client = LlamaCppClient()
with tracer.start_as_current_span("request"):
    response = client.chat.create(messages=[...])
server.stop_server()

Split-GPU with Visualization

server = ServerManager()
server.start_server(model_path=path, tensor_split="1.0,0.0")
# ... collect spans ...
g = TracesGraphistry(spans=spans)
g.plot(render=True)

API Documentation