Observability Overview¶

llamatelemetry v0.1.0 introduces the Observability Trilogy - three groundbreaking notebooks (14-16) that deliver production-grade observability for LLM inference workloads.

What is the Observability Trilogy?¶

The Observability Trilogy consists of three comprehensive notebooks that build upon each other to create a complete production observability stack:

Notebook 14: OpenTelemetry LLM Observability (45 min)¶

Full OpenTelemetry integration with semantic conventions for LLM inference.

Key Features:

Complete OpenTelemetry setup (traces, metrics, logs)
Semantic conventions for LLM operations
Distributed context propagation
OTLP export to popular backends (Jaeger, Grafana, DataDog)
Span hierarchy and relationships
Graph-based trace visualization with Graphistry

What You'll Learn:

OpenTelemetry fundamentals
LLM-specific semantic attributes
Trace hierarchy design
Exporter configuration
Graphistry visualization basics

Go to Tutorial 14

Notebook 15: Real-time Performance Monitoring (30 min)¶

Live GPU monitoring with real-time Plotly dashboards showing llama.cpp metrics.

Key Features:

llama.cpp /metrics endpoint integration
PyNVML GPU monitoring (VRAM, temperature, power, utilization)
Real-time Plotly FigureWidget dashboards
Live metric updates (1-second intervals)
Multi-panel visualization layout
Background thread monitoring

What You'll Learn:

llama.cpp metrics scraping
PyNVML GPU monitoring
Plotly FigureWidget for live updates
Multi-threaded monitoring patterns
Dashboard layout design

Go to Tutorial 15

Notebook 16: Production Observability Stack (45 min)¶

Complete production stack combining OpenTelemetry, GPU monitoring, Graphistry, and multi-dimensional Plotly visualizations.

Key Features:

Full OpenTelemetry + GPU monitoring integration
Advanced Graphistry trace visualization
Comprehensive Plotly dashboards (2D timeline + 3D scatter)
Multi-layer telemetry collection
Production-ready patterns
Real-world deployment examples

What You'll Learn:

Production observability architecture
Multi-layer telemetry integration
Advanced visualization techniques
Performance analysis workflows
Deployment best practices

Go to Tutorial 16

Comparison Matrix¶

Feature	Notebook 14	Notebook 15	Notebook 16
Focus	OpenTelemetry basics	Real-time monitoring	Complete production stack
Complexity	Intermediate	Intermediate-Advanced	Expert
Time	45 min	30 min	45 min
OpenTelemetry	✅ Full	❌	✅ Full + Advanced
llama.cpp Metrics	❌	✅ Full	✅ Full
GPU Monitoring	❌	✅ PyNVML	✅ PyNVML
Graphistry	✅ Basic	❌	✅ Advanced
Plotly 2D	✅ Static	✅ Live Updates	✅ Comprehensive
Plotly 3D	❌	❌	✅ Model Internals
Live Dashboards	❌	✅ FigureWidget	✅ Multi-panel
Production Ready	⚠️ Basics	⚠️ Partial	✅ Complete

Key Concepts¶

1. OpenTelemetry Integration¶

llamatelemetry provides native OpenTelemetry support:

from llamatelemetry.telemetry import setup_telemetry

# Initialize telemetry
tracer, meter = setup_telemetry(
    service_name="llm-service",
    otlp_endpoint="http://localhost:4317"
)

# Automatic instrumentation
with tracer.start_as_current_span("chat_completion") as span:
    response = client.chat.create(
        messages=[{"role": "user", "content": "Hello"}],
        max_tokens=50
    )
    span.set_attribute("llm.response_tokens", response.usage.completion_tokens)

2. GPU-Aware Metrics¶

Real-time GPU monitoring with PyNVML:

from llamatelemetry.telemetry.gpu_monitor import GPUMonitor

monitor = GPUMonitor(gpu_id=0, interval=1.0)
monitor.start()

# Get current metrics
metrics = monitor.get_metrics()
print(f"VRAM: {metrics['memory_used_mb']} MB")
print(f"Temperature: {metrics['temperature_c']}°C")
print(f"Power: {metrics['power_draw_w']} W")

3. llama.cpp Server Metrics¶

Integration with llama.cpp /metrics endpoint:

import requests

response = requests.get("http://127.0.0.1:8080/metrics")
metrics = response.text

# Parse Prometheus-format metrics
for line in metrics.split('\n'):
    if line.startswith('llamacpp_'):
        print(line)

4. Graph-Based Trace Visualization¶

Transform OpenTelemetry spans into interactive knowledge graphs:

from llamatelemetry.graphistry import TracesGraphistry

# Collect spans
spans = memory_exporter.get_finished_spans()

# Visualize on GPU 1
g = TracesGraphistry(spans=spans)
g.plot(
    render=True,
    point_title="span_name",
    point_color="duration_ms"
)

5. Real-Time Dashboards¶

Live Plotly dashboards with automatic updates:

import plotly.graph_objects as go

# Create live figure
fig = go.FigureWidget()

# Add traces
fig.add_trace(go.Scatter(name="VRAM"))
fig.add_trace(go.Scatter(name="Temperature"))

# Updates automatically from background thread
display(fig)

Architecture¶

The Observability Trilogy uses llamatelemetry's split-GPU architecture:

┌─────────────────────────────────────────────────────────────┐
│                  Observability Stack                        │
├─────────────────────────┬───────────────────────────────────┤
│       GPU 0 (15GB)      │        GPU 1 (15GB)              │
│                         │                                   │
│  ┌──────────────────┐  │  ┌───────────────────────────┐   │
│  │  llama-server    │  │  │  Graphistry              │   │
│  │  + /metrics      │  │  │  + Trace Graphs           │   │
│  │  + PyNVML        │  │  │  + RAPIDS cuGraph         │   │
│  └──────────────────┘  │  └───────────────────────────┘   │
│           │             │              │                    │
│           ▼             │              ▼                    │
│  ┌──────────────────┐  │  ┌───────────────────────────┐   │
│  │ OpenTelemetry    │──┼─▶│  Plotly Dashboards        │   │
│  │ + Traces         │  │  │  + Real-time 2D/3D       │   │
│  │ + Metrics        │  │  │  + Live Updates          │   │
│  │ + GPU Monitor    │  │  │  + Multi-panel Layout    │   │
│  └──────────────────┘  │  └───────────────────────────┘   │
└─────────────────────────┴───────────────────────────────────┘

Semantic Conventions¶

llamatelemetry follows OpenTelemetry semantic conventions for LLM operations:

Span Attributes¶

span.set_attributes({
    # LLM Operation
    "llm.operation": "chat_completion",
    "llm.model": "gemma-3-1b-it-Q4_K_M",
    "llm.provider": "llama.cpp",

    # Request
    "llm.request.max_tokens": 100,
    "llm.request.temperature": 0.7,
    "llm.request.messages": json.dumps(messages),

    # Response
    "llm.response.completion_tokens": response.usage.completion_tokens,
    "llm.response.prompt_tokens": response.usage.prompt_tokens,
    "llm.response.total_tokens": response.usage.total_tokens,

    # Performance
    "llm.latency_ms": duration_ms,
    "llm.tokens_per_second": tokens_per_sec,
})

Metrics¶

# Token throughput
tokens_per_second_gauge.set(tokens_per_sec)

# GPU metrics
gpu_memory_gauge.set(memory_used_mb)
gpu_temperature_gauge.set(temperature_c)
gpu_power_gauge.set(power_draw_w)

# Request latency
request_latency_histogram.record(duration_ms)

Recommended Learning Path¶

Follow this path to master LLM observability:

Path 1: OpenTelemetry Foundations (45 min)¶

Tutorial 14: OpenTelemetry LLM Observability

Learn:

OpenTelemetry basics
Trace and span creation
Semantic conventions
Graphistry visualization

Path 2: Real-Time Monitoring (30 min)¶

Tutorial 15: Real-time Performance Monitoring

Learn:

llama.cpp metrics
GPU monitoring with PyNVML
Live Plotly dashboards
Background monitoring threads

Path 3: Production Stack (45 min)¶

Tutorial 16: Production Observability Stack

Learn:

Complete observability architecture
Multi-layer telemetry
Advanced visualizations
Production deployment

Complete Observability Path (2 hours)¶

Tutorial 01 → Tutorial 03 → Tutorial 14 → Tutorial 15 → Tutorial 16

Outcome: Build complete production observability stack for LLM inference

Use Cases¶

Development & Testing¶

Notebook 14: Quick observability setup for debugging
Notebook 15: Real-time performance monitoring during development
Notebook 16: Comprehensive testing with full telemetry

Production Deployment¶

OpenTelemetry: Distributed tracing across services
GPU Monitoring: Resource utilization tracking
Real-time Dashboards: Live performance visibility
Graphistry: Trace analysis and debugging

Performance Optimization¶

llama.cpp metrics: Identify bottlenecks
GPU monitoring: Optimize VRAM usage
Trace analysis: Find slow operations
Token throughput: Measure improvements

Next Steps¶

Ready to build observable LLM systems?

Tutorial 14: OpenTelemetry - Start here
Tutorial 15: Real-time Monitoring - Add live monitoring
Tutorial 16: Production Stack - Complete the stack

Or explore specific topics:

OpenTelemetry Integration - Deep dive into OTel
Real-time Monitoring - GPU monitoring details
Production Stack - Deployment patterns
Traces & Spans - Trace design
Metrics & GPU Monitoring - Metrics collection