llamatelemetry v0.1.0¶
CUDA-first OpenTelemetry Python SDK for LLM inference observability and explainability
llamatelemetry combines the power of high-performance GGUF inference, multi-GPU execution, production-grade observability, and interactive visualization to deliver a complete LLM observability solution optimized for Kaggle dual Tesla T4 GPUs.
What You Get¶
llamatelemetry provides production-ready observability for LLM inference workloads
Quick Links¶
-
Quick Start
Get running in 5 minutes on Kaggle dual T4
-
Installation
Install llamatelemetry and dependencies
-
Observability Trilogy
NEW: Production observability stack (notebooks 14-16)
-
16 Tutorials
Comprehensive tutorials (5.5 hours total)
-
API Reference
Complete API documentation
-
Performance
Benchmarks and optimization
- LLM request tracing with semantic attributes and distributed context propagation
- GPU-aware metrics (latency, tokens/sec, VRAM usage, temperature, power draw)
- Split-GPU workflow (GPU 0: inference, GPU 1: analytics/visualization)
- Graph-based trace visualization with Graphistry interactive dashboards
- Real-time performance monitoring with live Plotly dashboards
- Production observability stack with multi-layer telemetry collection
- 16 comprehensive tutorials covering foundation to production workflows
Key Features¶
:material-gpu: CUDA Inference¶
- llama.cpp GGUF inference with 29 quantization types
- NCCL-aware multi-GPU execution (dual T4 tensor parallelism)
- FlashAttention v2, KV-cache optimization
- Continuous batching and optimization
- 1B-70B parameter model support
LLM Observability¶
- OpenTelemetry traces, metrics, and logs
- GPU-native resource detection
- OTLP export (gRPC/HTTP) to Grafana, Jaeger, DataDog
- llama.cpp
/metricsendpoint integration - PyNVML GPU monitoring
Visualization & Analytics¶
- RAPIDS cuGraph + Graphistry integration
- Interactive trace graphs (2D network visualization)
- Real-time Plotly dashboards (2D/3D)
- Knowledge graph extraction
- Neural network architecture visualization
Quick Start Example¶
# Install
!pip install -q --no-cache-dir --force-reinstall \
git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0
# Download model
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="unsloth/gemma-3-1b-it-GGUF",
filename="gemma-3-1b-it-Q4_K_M.gguf",
local_dir="/kaggle/working/models",
)
# Start server
from llamatelemetry.server import ServerManager
server = ServerManager()
server.start_server(
model_path=model_path,
gpu_layers=99,
tensor_split="1.0,0.0", # GPU 0 only
flash_attn=1,
)
# Run inference
from llamatelemetry.api.client import LlamaCppClient
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
response = client.chat.create(
messages=[{"role": "user", "content": "What is CUDA?"}],
max_tokens=80,
)
print(response.choices[0].message.content)
from llamatelemetry.telemetry import setup_telemetry
from llamatelemetry.api.client import LlamaCppClient
# Initialize telemetry
tracer, meter = setup_telemetry(
service_name="llm-service",
otlp_endpoint="http://localhost:4317"
)
# Client automatically instruments requests
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
with tracer.start_as_current_span("chat_completion") as span:
response = client.chat.create(
messages=[{"role": "user", "content": "Explain quantum computing"}],
max_tokens=200
)
span.set_attribute("llm.response_tokens", response.usage.completion_tokens)
# GPU 0: LLM inference
server = ServerManager()
server.start_server(
model_path=model_path,
gpu_layers=99,
tensor_split="1.0,0.0", # GPU 0 only
flash_attn=1,
)
# GPU 1: Graph visualization
import cudf
import graphistry
graphistry.register(api=3, protocol="https", server="hub.graphistry.com")
# Visualize traces on GPU 1
from llamatelemetry.graphistry import TracesGraphistry
g = TracesGraphistry(spans=collected_spans)
g.plot(render=True)
Platform Requirements¶
- Platform: Kaggle dual Tesla T4 (30GB VRAM total)
- Python: 3.11+
- CUDA: 12.5
- Model Range: 1B-70B parameters
- Recommended: 1B-5B parameters (Q4_K_M quantization)
16 Comprehensive Tutorials¶
llamatelemetry includes 16 production-ready Jupyter notebooks organized by learning path:
Foundation (Beginner) - 65 minutes¶
- 01-quickstart (10 min) - Basic inference setup
- 02-llama-server-setup (15 min) - Server configuration
- 03-multi-gpu-inference (20 min) - Dual GPU tensor parallelism
- 04-gguf-quantization (20 min) - 29 quantization types
Integration (Intermediate) - 60 minutes¶
- 05-unsloth-integration (30 min) - Fine-tuning to deployment
- 06-split-gpu-graphistry (30 min) - Concurrent LLM + analytics
Advanced Applications - 65 minutes¶
- 07-knowledge-graph-extraction (35 min) - LLM-powered graphs
- 08-document-network-analysis (30 min) - Document similarity
Optimization & Production - 120 minutes¶
- 09-large-models-kaggle (35 min) - 70B models on dual T4
- 10-complete-workflow (45 min) - End-to-end pipeline
- 11-gguf-neural-network-visualization (40 min) - 929-node architecture viz
Deep Dive - 80 minutes¶
- 12-gguf-attention-mechanism-explorer (25 min) - Q-K-V decomposition
- 13-gguf-token-embedding-visualizer (30 min) - 3D UMAP embedding space
Observability Trilogy ⭐ NEW - 120 minutes¶
- 14-opentelemetry-llm-observability (45 min) - Full OpenTelemetry integration
- 15-real-time-performance-monitoring (30 min) - Live Plotly dashboards
- 16-production-observability-stack (45 min) - Complete production stack
Learning Paths¶
Choose your learning path based on your goals:
Path 1: Quick Start (1 hour)¶
Outcome: Deploy and run LLM inference on Kaggle T4
Path 2: Full Foundation (3 hours)¶
Outcome: Deploy production-ready LLM systems
Path 3: Observability Focus ⭐ RECOMMENDED (2.5 hours)¶
Outcome: Build complete production observability stack
Path 4: Graph Analytics (2.5 hours)¶
Outcome: Build LLM-powered graph analytics applications
Path 5: Large Model Specialist (2 hours)¶
Outcome: Run 70B models on Kaggle dual T4
Architecture Overview¶
llamatelemetry uses a split-GPU architecture to maximize hardware utilization:
┌─────────────────────────────────────────────────────────────┐
│ Kaggle Dual T4 │
├─────────────────────────┬───────────────────────────────────┤
│ GPU 0 (15GB) │ GPU 1 (15GB) │
│ │ │
│ ┌──────────────────┐ │ ┌───────────────────────────┐ │
│ │ llama-server │ │ │ RAPIDS cuGraph │ │
│ │ + FlashAttn │ │ │ + cuDF │ │
│ │ + GGUF Model │ │ │ + Graphistry │ │
│ └──────────────────┘ │ └───────────────────────────┘ │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌──────────────────┐ │ ┌───────────────────────────┐ │
│ │ OpenTelemetry │──┼─▶│ Graph Visualization │ │
│ │ Traces + Metrics │ │ │ + Interactive Dashboards │ │
│ └──────────────────┘ │ └───────────────────────────┘ │
└─────────────────────────┴───────────────────────────────────┘
Performance¶
Optimized for small GGUF models (1B-5B parameters) on Kaggle dual T4:
| Model | Quantization | VRAM (GPU 0) | Tokens/sec | Context |
|---|---|---|---|---|
| Gemma 3-1B | Q4_K_M | ~1.2GB | ~85 | 8192 |
| Gemma 3-4B | Q4_K_M | ~2.5GB | ~45 | 8192 |
| Llama 3.2-3B | Q4_K_M | ~2.0GB | ~60 | 8192 |
| Phi 3.5-Mini | Q4_K_M | ~2.8GB | ~50 | 8192 |
Getting Started¶
Ready to dive in? Start with our comprehensive guides:
-
Installation
Install llamatelemetry in your environment
-
Tutorials
Learn with 16 hands-on notebooks
-
Architecture
Understand the split-GPU design
-
API Docs
Complete Python API reference
What's New in v0.1.0¶
Observability Trilogy ⭐ NEW¶
Three groundbreaking notebooks (14-16) that deliver production-grade observability:
- Notebook 14: Full OpenTelemetry integration with semantic conventions
- Notebook 15: Real-time GPU monitoring with live Plotly dashboards
- Notebook 16: Complete production stack with Graphistry + multi-layer telemetry
Key Features¶
- OpenTelemetry traces with distributed context propagation
- GPU-aware metrics (VRAM, temperature, power, utilization)
- llama.cpp server metrics integration
- Graph-based trace visualization
- Real-time performance dashboards
- Multi-layer telemetry collection
Explore Observability Features
System Requirements¶
Recommended (Kaggle)¶
- Platform: Kaggle Notebooks
- GPUs: 2× Tesla T4 (15GB VRAM each)
- CUDA: 12.x (pre-installed)
- Python: 3.11+
- Internet: Required for initial setup
Minimum¶
- GPUs: 1× Tesla T4 (15GB VRAM)
- CUDA: 12.0+
- Python: 3.10+
Community & Support¶
- GitHub: llamatelemetry/llamatelemetry
- Issues: Report bugs and feature requests
- Discussions: Ask questions and share ideas
- Changelog: CHANGELOG.md
Acknowledgments¶
llamatelemetry builds on these excellent projects:
- llama.cpp - High-performance GGUF inference
- OpenTelemetry - Observability framework
- Graphistry - GPU-accelerated graph visualization
- RAPIDS - GPU-accelerated data science
- Unsloth - Fast LLM fine-tuning
License¶
MIT License - See LICENSE for details.
Ready to get started? Follow the Quick Start Guide or jump into Tutorial 01: Quick Start.