Skip to content

llamatelemetry v0.1.0

CUDA-first OpenTelemetry Python SDK for LLM inference observability and explainability

llamatelemetry combines the power of high-performance GGUF inference, multi-GPU execution, production-grade observability, and interactive visualization to deliver a complete LLM observability solution optimized for Kaggle dual Tesla T4 GPUs.


What You Get

llamatelemetry provides production-ready observability for LLM inference workloads


  • LLM request tracing with semantic attributes and distributed context propagation
  • GPU-aware metrics (latency, tokens/sec, VRAM usage, temperature, power draw)
  • Split-GPU workflow (GPU 0: inference, GPU 1: analytics/visualization)
  • Graph-based trace visualization with Graphistry interactive dashboards
  • Real-time performance monitoring with live Plotly dashboards
  • Production observability stack with multi-layer telemetry collection
  • 16 comprehensive tutorials covering foundation to production workflows

Key Features

:material-gpu: CUDA Inference

  • llama.cpp GGUF inference with 29 quantization types
  • NCCL-aware multi-GPU execution (dual T4 tensor parallelism)
  • FlashAttention v2, KV-cache optimization
  • Continuous batching and optimization
  • 1B-70B parameter model support

LLM Observability

  • OpenTelemetry traces, metrics, and logs
  • GPU-native resource detection
  • OTLP export (gRPC/HTTP) to Grafana, Jaeger, DataDog
  • llama.cpp /metrics endpoint integration
  • PyNVML GPU monitoring

Visualization & Analytics

  • RAPIDS cuGraph + Graphistry integration
  • Interactive trace graphs (2D network visualization)
  • Real-time Plotly dashboards (2D/3D)
  • Knowledge graph extraction
  • Neural network architecture visualization

Quick Start Example

# Install
!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

# Download model
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",
)

# Start server
from llamatelemetry.server import ServerManager
server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",  # GPU 0 only
    flash_attn=1,
)

# Run inference
from llamatelemetry.api.client import LlamaCppClient
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
response = client.chat.create(
    messages=[{"role": "user", "content": "What is CUDA?"}],
    max_tokens=80,
)
print(response.choices[0].message.content)
from llamatelemetry.telemetry import setup_telemetry
from llamatelemetry.api.client import LlamaCppClient

# Initialize telemetry
tracer, meter = setup_telemetry(
    service_name="llm-service",
    otlp_endpoint="http://localhost:4317"
)

# Client automatically instruments requests
client = LlamaCppClient(base_url="http://127.0.0.1:8080")

with tracer.start_as_current_span("chat_completion") as span:
    response = client.chat.create(
        messages=[{"role": "user", "content": "Explain quantum computing"}],
        max_tokens=200
    )
    span.set_attribute("llm.response_tokens", response.usage.completion_tokens)
# GPU 0: LLM inference
server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",  # GPU 0 only
    flash_attn=1,
)

# GPU 1: Graph visualization
import cudf
import graphistry
graphistry.register(api=3, protocol="https", server="hub.graphistry.com")

# Visualize traces on GPU 1
from llamatelemetry.graphistry import TracesGraphistry
g = TracesGraphistry(spans=collected_spans)
g.plot(render=True)

Platform Requirements

  • Platform: Kaggle dual Tesla T4 (30GB VRAM total)
  • Python: 3.11+
  • CUDA: 12.5
  • Model Range: 1B-70B parameters
  • Recommended: 1B-5B parameters (Q4_K_M quantization)

16 Comprehensive Tutorials

llamatelemetry includes 16 production-ready Jupyter notebooks organized by learning path:

Foundation (Beginner) - 65 minutes

  1. 01-quickstart (10 min) - Basic inference setup
  2. 02-llama-server-setup (15 min) - Server configuration
  3. 03-multi-gpu-inference (20 min) - Dual GPU tensor parallelism
  4. 04-gguf-quantization (20 min) - 29 quantization types

Integration (Intermediate) - 60 minutes

  1. 05-unsloth-integration (30 min) - Fine-tuning to deployment
  2. 06-split-gpu-graphistry (30 min) - Concurrent LLM + analytics

Advanced Applications - 65 minutes

  1. 07-knowledge-graph-extraction (35 min) - LLM-powered graphs
  2. 08-document-network-analysis (30 min) - Document similarity

Optimization & Production - 120 minutes

  1. 09-large-models-kaggle (35 min) - 70B models on dual T4
  2. 10-complete-workflow (45 min) - End-to-end pipeline
  3. 11-gguf-neural-network-visualization (40 min) - 929-node architecture viz

Deep Dive - 80 minutes

  1. 12-gguf-attention-mechanism-explorer (25 min) - Q-K-V decomposition
  2. 13-gguf-token-embedding-visualizer (30 min) - 3D UMAP embedding space

Observability Trilogy ⭐ NEW - 120 minutes

  1. 14-opentelemetry-llm-observability (45 min) - Full OpenTelemetry integration
  2. 15-real-time-performance-monitoring (30 min) - Live Plotly dashboards
  3. 16-production-observability-stack (45 min) - Complete production stack

Explore All Tutorials


Learning Paths

Choose your learning path based on your goals:

Path 1: Quick Start (1 hour)

01 → 02 → 03

Outcome: Deploy and run LLM inference on Kaggle T4

Path 2: Full Foundation (3 hours)

01 → 02 → 03 → 04 → 05 → 06 → 10

Outcome: Deploy production-ready LLM systems

01 → 03 → 14 → 15 → 16

Outcome: Build complete production observability stack

Path 4: Graph Analytics (2.5 hours)

01 → 03 → 06 → 07 → 08 → 11

Outcome: Build LLM-powered graph analytics applications

Path 5: Large Model Specialist (2 hours)

01 → 03 → 04 → 09

Outcome: Run 70B models on Kaggle dual T4


Architecture Overview

llamatelemetry uses a split-GPU architecture to maximize hardware utilization:

┌─────────────────────────────────────────────────────────────┐
│                     Kaggle Dual T4                         │
├─────────────────────────┬───────────────────────────────────┤
│       GPU 0 (15GB)      │        GPU 1 (15GB)              │
│                         │                                   │
│  ┌──────────────────┐  │  ┌───────────────────────────┐   │
│  │  llama-server    │  │  │  RAPIDS cuGraph          │   │
│  │  + FlashAttn     │  │  │  + cuDF                   │   │
│  │  + GGUF Model    │  │  │  + Graphistry             │   │
│  └──────────────────┘  │  └───────────────────────────┘   │
│           │             │              │                    │
│           ▼             │              ▼                    │
│  ┌──────────────────┐  │  ┌───────────────────────────┐   │
│  │ OpenTelemetry    │──┼─▶│  Graph Visualization      │   │
│  │ Traces + Metrics │  │  │  + Interactive Dashboards │   │
│  └──────────────────┘  │  └───────────────────────────┘   │
└─────────────────────────┴───────────────────────────────────┘

Learn More About Architecture


Performance

Optimized for small GGUF models (1B-5B parameters) on Kaggle dual T4:

Model Quantization VRAM (GPU 0) Tokens/sec Context
Gemma 3-1B Q4_K_M ~1.2GB ~85 8192
Gemma 3-4B Q4_K_M ~2.5GB ~45 8192
Llama 3.2-3B Q4_K_M ~2.0GB ~60 8192
Phi 3.5-Mini Q4_K_M ~2.8GB ~50 8192

View Full Benchmarks


Getting Started

Ready to dive in? Start with our comprehensive guides:

What's New in v0.1.0

Observability Trilogy ⭐ NEW

Three groundbreaking notebooks (14-16) that deliver production-grade observability:

  • Notebook 14: Full OpenTelemetry integration with semantic conventions
  • Notebook 15: Real-time GPU monitoring with live Plotly dashboards
  • Notebook 16: Complete production stack with Graphistry + multi-layer telemetry

Key Features

  • OpenTelemetry traces with distributed context propagation
  • GPU-aware metrics (VRAM, temperature, power, utilization)
  • llama.cpp server metrics integration
  • Graph-based trace visualization
  • Real-time performance dashboards
  • Multi-layer telemetry collection

Explore Observability Features


System Requirements

  • Platform: Kaggle Notebooks
  • GPUs: 2× Tesla T4 (15GB VRAM each)
  • CUDA: 12.x (pre-installed)
  • Python: 3.11+
  • Internet: Required for initial setup

Minimum

  • GPUs: 1× Tesla T4 (15GB VRAM)
  • CUDA: 12.0+
  • Python: 3.10+

Community & Support


Acknowledgments

llamatelemetry builds on these excellent projects:


License

MIT License - See LICENSE for details.


Ready to get started? Follow the Quick Start Guide or jump into Tutorial 01: Quick Start.