llamatelemetry v0.1.0¶

CUDA-first OpenTelemetry Python SDK for LLM inference observability and explainability

llamatelemetry combines the power of high-performance GGUF inference, multi-GPU execution, production-grade observability, and interactive visualization to deliver a complete LLM observability solution optimized for Kaggle dual Tesla T4 GPUs.

What You Get¶

llamatelemetry provides production-ready observability for LLM inference workloads

Quick Links¶

Quick Start

Get running in 5 minutes on Kaggle dual T4

Quick Start Guide
Installation

Install llamatelemetry and dependencies

Installation Guide
Observability Trilogy

NEW: Production observability stack (notebooks 14-16)

Observability Guide
16 Tutorials

Comprehensive tutorials (5.5 hours total)

All Tutorials
API Reference

Complete API documentation

API Reference
Performance

Benchmarks and optimization

Performance Guide

LLM request tracing with semantic attributes and distributed context propagation
GPU-aware metrics (latency, tokens/sec, VRAM usage, temperature, power draw)
Split-GPU workflow (GPU 0: inference, GPU 1: analytics/visualization)
Graph-based trace visualization with Graphistry interactive dashboards
Real-time performance monitoring with live Plotly dashboards
Production observability stack with multi-layer telemetry collection
16 comprehensive tutorials covering foundation to production workflows

Key Features¶

:material-gpu: CUDA Inference¶

llama.cpp GGUF inference with 29 quantization types
NCCL-aware multi-GPU execution (dual T4 tensor parallelism)
FlashAttention v2, KV-cache optimization
Continuous batching and optimization
1B-70B parameter model support

LLM Observability¶

OpenTelemetry traces, metrics, and logs
GPU-native resource detection
OTLP export (gRPC/HTTP) to Grafana, Jaeger, DataDog
llama.cpp /metrics endpoint integration
PyNVML GPU monitoring

Visualization & Analytics¶

RAPIDS cuGraph + Graphistry integration
Interactive trace graphs (2D network visualization)
Real-time Plotly dashboards (2D/3D)
Knowledge graph extraction
Neural network architecture visualization

Quick Start Example¶

Basic InferenceWith OpenTelemetrySplit-GPU with Graphistry

# Install
!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

# Download model
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",
)

# Start server
from llamatelemetry.server import ServerManager
server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",  # GPU 0 only
    flash_attn=1,
)

# Run inference
from llamatelemetry.api.client import LlamaCppClient
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
response = client.chat.create(
    messages=[{"role": "user", "content": "What is CUDA?"}],
    max_tokens=80,
)
print(response.choices[0].message.content)

from llamatelemetry.telemetry import setup_telemetry
from llamatelemetry.api.client import LlamaCppClient

# Initialize telemetry
tracer, meter = setup_telemetry(
    service_name="llm-service",
    otlp_endpoint="http://localhost:4317"
)

# Client automatically instruments requests
client = LlamaCppClient(base_url="http://127.0.0.1:8080")

with tracer.start_as_current_span("chat_completion") as span:
    response = client.chat.create(
        messages=[{"role": "user", "content": "Explain quantum computing"}],
        max_tokens=200
    )
    span.set_attribute("llm.response_tokens", response.usage.completion_tokens)

# GPU 0: LLM inference
server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",  # GPU 0 only
    flash_attn=1,
)

# GPU 1: Graph visualization
import cudf
import graphistry
graphistry.register(api=3, protocol="https", server="hub.graphistry.com")

# Visualize traces on GPU 1
from llamatelemetry.graphistry import TracesGraphistry
g = TracesGraphistry(spans=collected_spans)
g.plot(render=True)

Platform Requirements¶

Platform: Kaggle dual Tesla T4 (30GB VRAM total)
Python: 3.11+
CUDA: 12.5
Model Range: 1B-70B parameters
Recommended: 1B-5B parameters (Q4_K_M quantization)

16 Comprehensive Tutorials¶

llamatelemetry includes 16 production-ready Jupyter notebooks organized by learning path:

Foundation (Beginner) - 65 minutes¶

01-quickstart (10 min) - Basic inference setup
02-llama-server-setup (15 min) - Server configuration
03-multi-gpu-inference (20 min) - Dual GPU tensor parallelism
04-gguf-quantization (20 min) - 29 quantization types

Integration (Intermediate) - 60 minutes¶

05-unsloth-integration (30 min) - Fine-tuning to deployment
06-split-gpu-graphistry (30 min) - Concurrent LLM + analytics

Advanced Applications - 65 minutes¶

07-knowledge-graph-extraction (35 min) - LLM-powered graphs
08-document-network-analysis (30 min) - Document similarity

Optimization & Production - 120 minutes¶

09-large-models-kaggle (35 min) - 70B models on dual T4
10-complete-workflow (45 min) - End-to-end pipeline
11-gguf-neural-network-visualization (40 min) - 929-node architecture viz

Deep Dive - 80 minutes¶

12-gguf-attention-mechanism-explorer (25 min) - Q-K-V decomposition
13-gguf-token-embedding-visualizer (30 min) - 3D UMAP embedding space

Observability Trilogy ⭐ NEW - 120 minutes¶

14-opentelemetry-llm-observability (45 min) - Full OpenTelemetry integration
15-real-time-performance-monitoring (30 min) - Live Plotly dashboards
16-production-observability-stack (45 min) - Complete production stack

Explore All Tutorials

Learning Paths¶

Choose your learning path based on your goals:

Path 1: Quick Start (1 hour)¶

01 → 02 → 03

Outcome: Deploy and run LLM inference on Kaggle T4

Path 2: Full Foundation (3 hours)¶

01 → 02 → 03 → 04 → 05 → 06 → 10

Outcome: Deploy production-ready LLM systems

Path 3: Observability Focus ⭐ RECOMMENDED (2.5 hours)¶

01 → 03 → 14 → 15 → 16

Outcome: Build complete production observability stack

Path 4: Graph Analytics (2.5 hours)¶

01 → 03 → 06 → 07 → 08 → 11

Outcome: Build LLM-powered graph analytics applications

Path 5: Large Model Specialist (2 hours)¶

01 → 03 → 04 → 09

Outcome: Run 70B models on Kaggle dual T4

Architecture Overview¶

llamatelemetry uses a split-GPU architecture to maximize hardware utilization:

┌─────────────────────────────────────────────────────────────┐
│                     Kaggle Dual T4                         │
├─────────────────────────┬───────────────────────────────────┤
│       GPU 0 (15GB)      │        GPU 1 (15GB)              │
│                         │                                   │
│  ┌──────────────────┐  │  ┌───────────────────────────┐   │
│  │  llama-server    │  │  │  RAPIDS cuGraph          │   │
│  │  + FlashAttn     │  │  │  + cuDF                   │   │
│  │  + GGUF Model    │  │  │  + Graphistry             │   │
│  └──────────────────┘  │  └───────────────────────────┘   │
│           │             │              │                    │
│           ▼             │              ▼                    │
│  ┌──────────────────┐  │  ┌───────────────────────────┐   │
│  │ OpenTelemetry    │──┼─▶│  Graph Visualization      │   │
│  │ Traces + Metrics │  │  │  + Interactive Dashboards │   │
│  └──────────────────┘  │  └───────────────────────────┘   │
└─────────────────────────┴───────────────────────────────────┘

Learn More About Architecture

Performance¶

Optimized for small GGUF models (1B-5B parameters) on Kaggle dual T4:

Model	Quantization	VRAM (GPU 0)	Tokens/sec	Context
Gemma 3-1B	Q4_K_M	~1.2GB	~85	8192
Gemma 3-4B	Q4_K_M	~2.5GB	~45	8192
Llama 3.2-3B	Q4_K_M	~2.0GB	~60	8192
Phi 3.5-Mini	Q4_K_M	~2.8GB	~50	8192

View Full Benchmarks

Getting Started¶

Ready to dive in? Start with our comprehensive guides:

Installation

Install llamatelemetry in your environment

Install now
Tutorials

Learn with 16 hands-on notebooks

Start learning
Architecture

Understand the split-GPU design

Learn architecture
API Docs

Complete Python API reference

Explore API

What's New in v0.1.0¶

Observability Trilogy ⭐ NEW¶

Three groundbreaking notebooks (14-16) that deliver production-grade observability:

Notebook 14: Full OpenTelemetry integration with semantic conventions
Notebook 15: Real-time GPU monitoring with live Plotly dashboards
Notebook 16: Complete production stack with Graphistry + multi-layer telemetry

Key Features¶

OpenTelemetry traces with distributed context propagation
GPU-aware metrics (VRAM, temperature, power, utilization)
llama.cpp server metrics integration
Graph-based trace visualization
Real-time performance dashboards
Multi-layer telemetry collection

Explore Observability Features

System Requirements¶

Recommended (Kaggle)¶

Platform: Kaggle Notebooks
GPUs: 2× Tesla T4 (15GB VRAM each)
CUDA: 12.x (pre-installed)
Python: 3.11+
Internet: Required for initial setup

Minimum¶

GPUs: 1× Tesla T4 (15GB VRAM)
CUDA: 12.0+
Python: 3.10+

Community & Support¶

GitHub: llamatelemetry/llamatelemetry
Issues: Report bugs and feature requests
Discussions: Ask questions and share ideas
Changelog: CHANGELOG.md

Acknowledgments¶

llamatelemetry builds on these excellent projects:

llama.cpp - High-performance GGUF inference
OpenTelemetry - Observability framework
Graphistry - GPU-accelerated graph visualization
RAPIDS - GPU-accelerated data science
Unsloth - Fast LLM fine-tuning

License¶

MIT License - See LICENSE for details.

Ready to get started? Follow the Quick Start Guide or jump into Tutorial 01: Quick Start.