API Reference¶
Complete API documentation for llamatelemetry v0.1.0.
Core APIs¶
ServerManager¶
Manage llama-server lifecycle:
from llamatelemetry.server import ServerManager
server = ServerManager()
server.start_server(
model_path="/path/to/model.gguf",
gpu_layers=99,
tensor_split="1.0,0.0",
flash_attn=1,
)
server.stop_server()
LlamaCppClient¶
OpenAI-compatible client:
from llamatelemetry.api.client import LlamaCppClient
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
# Chat completion
response = client.chat.create(
messages=[{"role": "user", "content": "Hello"}],
max_tokens=50
)
# Streaming
for chunk in client.chat.create(messages=[...], stream=True):
print(chunk.choices[0].delta.content, end="")
Telemetry Setup¶
Initialize OpenTelemetry:
from llamatelemetry.telemetry import setup_telemetry
tracer, meter = setup_telemetry(
service_name="my-service",
otlp_endpoint="http://localhost:4317"
)
# Use tracer
with tracer.start_as_current_span("operation") as span:
# Your code
span.set_attribute("key", "value")
Graphistry Integration¶
Graph visualization:
from llamatelemetry.graphistry import TracesGraphistry
g = TracesGraphistry(spans=collected_spans)
g.plot(
render=True,
point_title="span_name",
point_color="duration_ms"
)
Multi-GPU Utilities¶
GPU configuration and monitoring:
from llamatelemetry.api.multigpu import (
gpu_count,
kaggle_t4_dual_config,
get_gpu_info
)
# Get GPU count
print(f"GPUs: {gpu_count()}")
# Get optimized config
config = kaggle_t4_dual_config()
# Get GPU info
info = get_gpu_info(gpu_id=0)
print(f"VRAM: {info['memory_total_mb']} MB")
API Categories¶
Inference¶
- ServerManager - Server lifecycle management
- Client - OpenAI-compatible client
- Models - Model utilities
Observability¶
- Telemetry - OpenTelemetry setup
- GPU Monitoring - PyNVML integration
- Metrics - Custom metrics
Visualization¶
- Graphistry - Graph visualization
- GGUF Utilities - GGUF file parsing
Multi-GPU¶
- Multi-GPU - GPU utilities
- Configuration - Optimized configs
Quick Reference¶
Common Patterns¶
Basic Inference¶
server = ServerManager()
server.start_server(model_path=path)
client = LlamaCppClient()
response = client.chat.create(messages=[...])
server.stop_server()
With Telemetry¶
tracer, _ = setup_telemetry(service_name="llm")
server = ServerManager()
server.start_server(model_path=path)
client = LlamaCppClient()
with tracer.start_as_current_span("request"):
response = client.chat.create(messages=[...])
server.stop_server()
Split-GPU with Visualization¶
server = ServerManager()
server.start_server(model_path=path, tensor_split="1.0,0.0")
# ... collect spans ...
g = TracesGraphistry(spans=spans)
g.plot(render=True)
API Documentation¶
- Server Manager - Complete server API
- Client API - Complete client API
- Telemetry - OpenTelemetry integration
- Graphistry - Visualization API
- Multi-GPU - GPU utilities
- GGUF - GGUF utilities
- Models - Model utilities