Observability Overview¶
llamatelemetry v0.1.0 introduces the Observability Trilogy - three groundbreaking notebooks (14-16) that deliver production-grade observability for LLM inference workloads.
What is the Observability Trilogy?¶
The Observability Trilogy consists of three comprehensive notebooks that build upon each other to create a complete production observability stack:
Notebook 14: OpenTelemetry LLM Observability (45 min)¶
Full OpenTelemetry integration with semantic conventions for LLM inference.
Key Features:
- Complete OpenTelemetry setup (traces, metrics, logs)
- Semantic conventions for LLM operations
- Distributed context propagation
- OTLP export to popular backends (Jaeger, Grafana, DataDog)
- Span hierarchy and relationships
- Graph-based trace visualization with Graphistry
What You'll Learn:
- OpenTelemetry fundamentals
- LLM-specific semantic attributes
- Trace hierarchy design
- Exporter configuration
- Graphistry visualization basics
Notebook 15: Real-time Performance Monitoring (30 min)¶
Live GPU monitoring with real-time Plotly dashboards showing llama.cpp metrics.
Key Features:
- llama.cpp
/metricsendpoint integration - PyNVML GPU monitoring (VRAM, temperature, power, utilization)
- Real-time Plotly FigureWidget dashboards
- Live metric updates (1-second intervals)
- Multi-panel visualization layout
- Background thread monitoring
What You'll Learn:
- llama.cpp metrics scraping
- PyNVML GPU monitoring
- Plotly FigureWidget for live updates
- Multi-threaded monitoring patterns
- Dashboard layout design
Notebook 16: Production Observability Stack (45 min)¶
Complete production stack combining OpenTelemetry, GPU monitoring, Graphistry, and multi-dimensional Plotly visualizations.
Key Features:
- Full OpenTelemetry + GPU monitoring integration
- Advanced Graphistry trace visualization
- Comprehensive Plotly dashboards (2D timeline + 3D scatter)
- Multi-layer telemetry collection
- Production-ready patterns
- Real-world deployment examples
What You'll Learn:
- Production observability architecture
- Multi-layer telemetry integration
- Advanced visualization techniques
- Performance analysis workflows
- Deployment best practices
Comparison Matrix¶
| Feature | Notebook 14 | Notebook 15 | Notebook 16 |
|---|---|---|---|
| Focus | OpenTelemetry basics | Real-time monitoring | Complete production stack |
| Complexity | Intermediate | Intermediate-Advanced | Expert |
| Time | 45 min | 30 min | 45 min |
| OpenTelemetry | ✅ Full | ❌ | ✅ Full + Advanced |
| llama.cpp Metrics | ❌ | ✅ Full | ✅ Full |
| GPU Monitoring | ❌ | ✅ PyNVML | ✅ PyNVML |
| Graphistry | ✅ Basic | ❌ | ✅ Advanced |
| Plotly 2D | ✅ Static | ✅ Live Updates | ✅ Comprehensive |
| Plotly 3D | ❌ | ❌ | ✅ Model Internals |
| Live Dashboards | ❌ | ✅ FigureWidget | ✅ Multi-panel |
| Production Ready | ⚠️ Basics | ⚠️ Partial | ✅ Complete |
Key Concepts¶
1. OpenTelemetry Integration¶
llamatelemetry provides native OpenTelemetry support:
from llamatelemetry.telemetry import setup_telemetry
# Initialize telemetry
tracer, meter = setup_telemetry(
service_name="llm-service",
otlp_endpoint="http://localhost:4317"
)
# Automatic instrumentation
with tracer.start_as_current_span("chat_completion") as span:
response = client.chat.create(
messages=[{"role": "user", "content": "Hello"}],
max_tokens=50
)
span.set_attribute("llm.response_tokens", response.usage.completion_tokens)
2. GPU-Aware Metrics¶
Real-time GPU monitoring with PyNVML:
from llamatelemetry.telemetry.gpu_monitor import GPUMonitor
monitor = GPUMonitor(gpu_id=0, interval=1.0)
monitor.start()
# Get current metrics
metrics = monitor.get_metrics()
print(f"VRAM: {metrics['memory_used_mb']} MB")
print(f"Temperature: {metrics['temperature_c']}°C")
print(f"Power: {metrics['power_draw_w']} W")
3. llama.cpp Server Metrics¶
Integration with llama.cpp /metrics endpoint:
import requests
response = requests.get("http://127.0.0.1:8080/metrics")
metrics = response.text
# Parse Prometheus-format metrics
for line in metrics.split('\n'):
if line.startswith('llamacpp_'):
print(line)
4. Graph-Based Trace Visualization¶
Transform OpenTelemetry spans into interactive knowledge graphs:
from llamatelemetry.graphistry import TracesGraphistry
# Collect spans
spans = memory_exporter.get_finished_spans()
# Visualize on GPU 1
g = TracesGraphistry(spans=spans)
g.plot(
render=True,
point_title="span_name",
point_color="duration_ms"
)
5. Real-Time Dashboards¶
Live Plotly dashboards with automatic updates:
import plotly.graph_objects as go
# Create live figure
fig = go.FigureWidget()
# Add traces
fig.add_trace(go.Scatter(name="VRAM"))
fig.add_trace(go.Scatter(name="Temperature"))
# Updates automatically from background thread
display(fig)
Architecture¶
The Observability Trilogy uses llamatelemetry's split-GPU architecture:
┌─────────────────────────────────────────────────────────────┐
│ Observability Stack │
├─────────────────────────┬───────────────────────────────────┤
│ GPU 0 (15GB) │ GPU 1 (15GB) │
│ │ │
│ ┌──────────────────┐ │ ┌───────────────────────────┐ │
│ │ llama-server │ │ │ Graphistry │ │
│ │ + /metrics │ │ │ + Trace Graphs │ │
│ │ + PyNVML │ │ │ + RAPIDS cuGraph │ │
│ └──────────────────┘ │ └───────────────────────────┘ │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌──────────────────┐ │ ┌───────────────────────────┐ │
│ │ OpenTelemetry │──┼─▶│ Plotly Dashboards │ │
│ │ + Traces │ │ │ + Real-time 2D/3D │ │
│ │ + Metrics │ │ │ + Live Updates │ │
│ │ + GPU Monitor │ │ │ + Multi-panel Layout │ │
│ └──────────────────┘ │ └───────────────────────────┘ │
└─────────────────────────┴───────────────────────────────────┘
Semantic Conventions¶
llamatelemetry follows OpenTelemetry semantic conventions for LLM operations:
Span Attributes¶
span.set_attributes({
# LLM Operation
"llm.operation": "chat_completion",
"llm.model": "gemma-3-1b-it-Q4_K_M",
"llm.provider": "llama.cpp",
# Request
"llm.request.max_tokens": 100,
"llm.request.temperature": 0.7,
"llm.request.messages": json.dumps(messages),
# Response
"llm.response.completion_tokens": response.usage.completion_tokens,
"llm.response.prompt_tokens": response.usage.prompt_tokens,
"llm.response.total_tokens": response.usage.total_tokens,
# Performance
"llm.latency_ms": duration_ms,
"llm.tokens_per_second": tokens_per_sec,
})
Metrics¶
# Token throughput
tokens_per_second_gauge.set(tokens_per_sec)
# GPU metrics
gpu_memory_gauge.set(memory_used_mb)
gpu_temperature_gauge.set(temperature_c)
gpu_power_gauge.set(power_draw_w)
# Request latency
request_latency_histogram.record(duration_ms)
Recommended Learning Path¶
Follow this path to master LLM observability:
Path 1: OpenTelemetry Foundations (45 min)¶
Learn:
- OpenTelemetry basics
- Trace and span creation
- Semantic conventions
- Graphistry visualization
Path 2: Real-Time Monitoring (30 min)¶
Learn:
- llama.cpp metrics
- GPU monitoring with PyNVML
- Live Plotly dashboards
- Background monitoring threads
Path 3: Production Stack (45 min)¶
Learn:
- Complete observability architecture
- Multi-layer telemetry
- Advanced visualizations
- Production deployment
Complete Observability Path (2 hours)¶
Outcome: Build complete production observability stack for LLM inference
Use Cases¶
Development & Testing¶
- Notebook 14: Quick observability setup for debugging
- Notebook 15: Real-time performance monitoring during development
- Notebook 16: Comprehensive testing with full telemetry
Production Deployment¶
- OpenTelemetry: Distributed tracing across services
- GPU Monitoring: Resource utilization tracking
- Real-time Dashboards: Live performance visibility
- Graphistry: Trace analysis and debugging
Performance Optimization¶
- llama.cpp metrics: Identify bottlenecks
- GPU monitoring: Optimize VRAM usage
- Trace analysis: Find slow operations
- Token throughput: Measure improvements
Next Steps¶
Ready to build observable LLM systems?
- Tutorial 14: OpenTelemetry - Start here
- Tutorial 15: Real-time Monitoring - Add live monitoring
- Tutorial 16: Production Stack - Complete the stack
Or explore specific topics:
- OpenTelemetry Integration - Deep dive into OTel
- Real-time Monitoring - GPU monitoring details
- Production Stack - Deployment patterns
- Traces & Spans - Trace design
- Metrics & GPU Monitoring - Metrics collection