Skip to content

Observability Overview

llamatelemetry v0.1.0 introduces the Observability Trilogy - three groundbreaking notebooks (14-16) that deliver production-grade observability for LLM inference workloads.


What is the Observability Trilogy?

The Observability Trilogy consists of three comprehensive notebooks that build upon each other to create a complete production observability stack:

Notebook 14: OpenTelemetry LLM Observability (45 min)

Full OpenTelemetry integration with semantic conventions for LLM inference.

Key Features:

  • Complete OpenTelemetry setup (traces, metrics, logs)
  • Semantic conventions for LLM operations
  • Distributed context propagation
  • OTLP export to popular backends (Jaeger, Grafana, DataDog)
  • Span hierarchy and relationships
  • Graph-based trace visualization with Graphistry

What You'll Learn:

  • OpenTelemetry fundamentals
  • LLM-specific semantic attributes
  • Trace hierarchy design
  • Exporter configuration
  • Graphistry visualization basics

Go to Tutorial 14


Notebook 15: Real-time Performance Monitoring (30 min)

Live GPU monitoring with real-time Plotly dashboards showing llama.cpp metrics.

Key Features:

  • llama.cpp /metrics endpoint integration
  • PyNVML GPU monitoring (VRAM, temperature, power, utilization)
  • Real-time Plotly FigureWidget dashboards
  • Live metric updates (1-second intervals)
  • Multi-panel visualization layout
  • Background thread monitoring

What You'll Learn:

  • llama.cpp metrics scraping
  • PyNVML GPU monitoring
  • Plotly FigureWidget for live updates
  • Multi-threaded monitoring patterns
  • Dashboard layout design

Go to Tutorial 15


Notebook 16: Production Observability Stack (45 min)

Complete production stack combining OpenTelemetry, GPU monitoring, Graphistry, and multi-dimensional Plotly visualizations.

Key Features:

  • Full OpenTelemetry + GPU monitoring integration
  • Advanced Graphistry trace visualization
  • Comprehensive Plotly dashboards (2D timeline + 3D scatter)
  • Multi-layer telemetry collection
  • Production-ready patterns
  • Real-world deployment examples

What You'll Learn:

  • Production observability architecture
  • Multi-layer telemetry integration
  • Advanced visualization techniques
  • Performance analysis workflows
  • Deployment best practices

Go to Tutorial 16


Comparison Matrix

Feature Notebook 14 Notebook 15 Notebook 16
Focus OpenTelemetry basics Real-time monitoring Complete production stack
Complexity Intermediate Intermediate-Advanced Expert
Time 45 min 30 min 45 min
OpenTelemetry ✅ Full ✅ Full + Advanced
llama.cpp Metrics ✅ Full ✅ Full
GPU Monitoring ✅ PyNVML ✅ PyNVML
Graphistry ✅ Basic ✅ Advanced
Plotly 2D ✅ Static ✅ Live Updates ✅ Comprehensive
Plotly 3D ✅ Model Internals
Live Dashboards ✅ FigureWidget ✅ Multi-panel
Production Ready ⚠️ Basics ⚠️ Partial ✅ Complete

Key Concepts

1. OpenTelemetry Integration

llamatelemetry provides native OpenTelemetry support:

from llamatelemetry.telemetry import setup_telemetry

# Initialize telemetry
tracer, meter = setup_telemetry(
    service_name="llm-service",
    otlp_endpoint="http://localhost:4317"
)

# Automatic instrumentation
with tracer.start_as_current_span("chat_completion") as span:
    response = client.chat.create(
        messages=[{"role": "user", "content": "Hello"}],
        max_tokens=50
    )
    span.set_attribute("llm.response_tokens", response.usage.completion_tokens)

2. GPU-Aware Metrics

Real-time GPU monitoring with PyNVML:

from llamatelemetry.telemetry.gpu_monitor import GPUMonitor

monitor = GPUMonitor(gpu_id=0, interval=1.0)
monitor.start()

# Get current metrics
metrics = monitor.get_metrics()
print(f"VRAM: {metrics['memory_used_mb']} MB")
print(f"Temperature: {metrics['temperature_c']}°C")
print(f"Power: {metrics['power_draw_w']} W")

3. llama.cpp Server Metrics

Integration with llama.cpp /metrics endpoint:

import requests

response = requests.get("http://127.0.0.1:8080/metrics")
metrics = response.text

# Parse Prometheus-format metrics
for line in metrics.split('\n'):
    if line.startswith('llamacpp_'):
        print(line)

4. Graph-Based Trace Visualization

Transform OpenTelemetry spans into interactive knowledge graphs:

from llamatelemetry.graphistry import TracesGraphistry

# Collect spans
spans = memory_exporter.get_finished_spans()

# Visualize on GPU 1
g = TracesGraphistry(spans=spans)
g.plot(
    render=True,
    point_title="span_name",
    point_color="duration_ms"
)

5. Real-Time Dashboards

Live Plotly dashboards with automatic updates:

import plotly.graph_objects as go

# Create live figure
fig = go.FigureWidget()

# Add traces
fig.add_trace(go.Scatter(name="VRAM"))
fig.add_trace(go.Scatter(name="Temperature"))

# Updates automatically from background thread
display(fig)

Architecture

The Observability Trilogy uses llamatelemetry's split-GPU architecture:

┌─────────────────────────────────────────────────────────────┐
│                  Observability Stack                        │
├─────────────────────────┬───────────────────────────────────┤
│       GPU 0 (15GB)      │        GPU 1 (15GB)              │
│                         │                                   │
│  ┌──────────────────┐  │  ┌───────────────────────────┐   │
│  │  llama-server    │  │  │  Graphistry              │   │
│  │  + /metrics      │  │  │  + Trace Graphs           │   │
│  │  + PyNVML        │  │  │  + RAPIDS cuGraph         │   │
│  └──────────────────┘  │  └───────────────────────────┘   │
│           │             │              │                    │
│           ▼             │              ▼                    │
│  ┌──────────────────┐  │  ┌───────────────────────────┐   │
│  │ OpenTelemetry    │──┼─▶│  Plotly Dashboards        │   │
│  │ + Traces         │  │  │  + Real-time 2D/3D       │   │
│  │ + Metrics        │  │  │  + Live Updates          │   │
│  │ + GPU Monitor    │  │  │  + Multi-panel Layout    │   │
│  └──────────────────┘  │  └───────────────────────────┘   │
└─────────────────────────┴───────────────────────────────────┘

Semantic Conventions

llamatelemetry follows OpenTelemetry semantic conventions for LLM operations:

Span Attributes

span.set_attributes({
    # LLM Operation
    "llm.operation": "chat_completion",
    "llm.model": "gemma-3-1b-it-Q4_K_M",
    "llm.provider": "llama.cpp",

    # Request
    "llm.request.max_tokens": 100,
    "llm.request.temperature": 0.7,
    "llm.request.messages": json.dumps(messages),

    # Response
    "llm.response.completion_tokens": response.usage.completion_tokens,
    "llm.response.prompt_tokens": response.usage.prompt_tokens,
    "llm.response.total_tokens": response.usage.total_tokens,

    # Performance
    "llm.latency_ms": duration_ms,
    "llm.tokens_per_second": tokens_per_sec,
})

Metrics

# Token throughput
tokens_per_second_gauge.set(tokens_per_sec)

# GPU metrics
gpu_memory_gauge.set(memory_used_mb)
gpu_temperature_gauge.set(temperature_c)
gpu_power_gauge.set(power_draw_w)

# Request latency
request_latency_histogram.record(duration_ms)

Follow this path to master LLM observability:

Path 1: OpenTelemetry Foundations (45 min)

Tutorial 14: OpenTelemetry LLM Observability

Learn:

  • OpenTelemetry basics
  • Trace and span creation
  • Semantic conventions
  • Graphistry visualization

Path 2: Real-Time Monitoring (30 min)

Tutorial 15: Real-time Performance Monitoring

Learn:

  • llama.cpp metrics
  • GPU monitoring with PyNVML
  • Live Plotly dashboards
  • Background monitoring threads

Path 3: Production Stack (45 min)

Tutorial 16: Production Observability Stack

Learn:

  • Complete observability architecture
  • Multi-layer telemetry
  • Advanced visualizations
  • Production deployment

Complete Observability Path (2 hours)

Tutorial 01 → Tutorial 03 → Tutorial 14 → Tutorial 15 → Tutorial 16

Outcome: Build complete production observability stack for LLM inference


Use Cases

Development & Testing

  • Notebook 14: Quick observability setup for debugging
  • Notebook 15: Real-time performance monitoring during development
  • Notebook 16: Comprehensive testing with full telemetry

Production Deployment

  • OpenTelemetry: Distributed tracing across services
  • GPU Monitoring: Resource utilization tracking
  • Real-time Dashboards: Live performance visibility
  • Graphistry: Trace analysis and debugging

Performance Optimization

  • llama.cpp metrics: Identify bottlenecks
  • GPU monitoring: Optimize VRAM usage
  • Trace analysis: Find slow operations
  • Token throughput: Measure improvements

Next Steps

Ready to build observable LLM systems?

  1. Tutorial 14: OpenTelemetry - Start here
  2. Tutorial 15: Real-time Monitoring - Add live monitoring
  3. Tutorial 16: Production Stack - Complete the stack

Or explore specific topics: