Tutorial Notebooks¶
llamatelemetry includes 16 comprehensive Jupyter notebooks covering foundation to production-ready observability workflows. Total time: 5.5 hours.
Quick Navigation¶
-
Foundation
Beginner-friendly tutorials (65 minutes)
Notebooks 01-04
-
Integration
Intermediate integration tutorials (60 minutes)
Notebooks 05-06
-
Advanced
Advanced applications (65 minutes)
Notebooks 07-08
-
Production
Optimization & production (120 minutes)
Notebooks 09-11
-
Deep Dive
Model internals (80 minutes)
Notebooks 12-13
-
:material-eye-star:{ .lg .middle } Observability ⭐ NEW
Production observability (120 minutes)
Notebooks 14-16
Foundation (Beginner)¶
Total time: 65 minutes | Difficulty: Beginner
Perfect starting point for llamatelemetry. Learn the basics of GGUF inference, multi-GPU setup, and quantization.
01: Quick Start (10 min)¶
Basic inference setup with llamatelemetry on Kaggle dual T4.
- Install llamatelemetry v0.1.0
- Download GGUF model from HuggingFace
- Start llama-server with split-GPU configuration
- Run chat completions and streaming
- Monitor GPU memory usage
02: Server Setup (15 min)¶
Advanced server configuration and optimization.
- ServerManager API deep dive
- GPU layer allocation strategies
- Context size optimization
- FlashAttention configuration
- Batch processing settings
03: Multi-GPU Inference (20 min)¶
Dual GPU tensor parallelism and split-GPU workflows.
- Tensor split configuration (
tensor_split) - Split-GPU architecture (GPU 0: LLM, GPU 1: Analytics)
- VRAM distribution across GPUs
- Performance comparison (single vs dual GPU)
- Best practices for Kaggle T4×2
04: GGUF Quantization (20 min)¶
Comprehensive guide to GGUF quantization types and selection.
- 29 quantization types overview
- K-Quants vs I-Quants
- Quality vs size tradeoffs
- VRAM estimation formulas
- Model selection guide
Integration (Intermediate)¶
Total time: 60 minutes | Difficulty: Intermediate
Integrate llamatelemetry with Unsloth fine-tuning and Graphistry visualization.
05: Unsloth Integration (30 min)¶
Complete workflow from fine-tuning to deployment.
- Unsloth fine-tuning setup
- GGUF export with llama.cpp
- Model quantization
- Deployment with llamatelemetry
- End-to-end pipeline
06: Split-GPU Graphistry (30 min)¶
Concurrent LLM inference and RAPIDS analytics.
- Split-GPU architecture setup
- RAPIDS cuGraph on GPU 1
- Graphistry interactive visualization
- Zero-copy data transfer
- Performance optimization
Advanced Applications¶
Total time: 65 minutes | Difficulty: Advanced
Build sophisticated LLM-powered applications with graph analytics.
07: Knowledge Graph Extraction (35 min)¶
LLM-powered knowledge graph construction.
- Entity and relationship extraction
- Knowledge graph schema design
- Graph construction with RAPIDS cuGraph
- Graphistry visualization
- Real-world use cases
08: Document Network Analysis (30 min)¶
Document similarity networks and clustering.
- Document embedding with LLM
- Similarity computation
- Network construction
- Community detection
- Interactive exploration
Optimization & Production¶
Total time: 120 minutes | Difficulty: Advanced-Expert
Optimize for large models and build production workflows.
09: Large Models (13B-70B) (35 min)¶
Run massive models on Kaggle dual T4 with advanced techniques.
- Tensor split optimization for 13B+ models
- Layer offloading strategies
- Quantization selection for large models
- Memory management
- Performance tuning
10: Complete Workflow (45 min)¶
End-to-end production pipeline from training to deployment.
- Data preparation
- Unsloth fine-tuning
- GGUF conversion and quantization
- llamatelemetry deployment
- Monitoring and observability
11: GGUF Neural Network Visualization (40 min)¶
Groundbreaking architecture visualization with 929 nodes and 981 edges.
- GGUF file parsing
- Neural network graph extraction
- Layer-by-layer visualization
- Interactive exploration with Graphistry
- Architecture analysis
Deep Dive¶
Total time: 80 minutes | Difficulty: Expert
Explore model internals with interactive visualizations.
12: Attention Mechanism Explorer (25 min)¶
Q-K-V decomposition and 896 attention heads visualization.
- Attention mechanism breakdown
- Query, Key, Value tensor analysis
- Multi-head attention visualization
- Attention pattern analysis
- Layer-wise comparison
13: Token Embedding Visualizer (30 min)¶
3D UMAP embedding space exploration.
- Token embedding extraction
- Dimensionality reduction with UMAP
- 3D visualization with Plotly
- Semantic clustering analysis
- Interactive exploration
Observability Trilogy ⭐ NEW¶
Total time: 120 minutes | Difficulty: Intermediate-Expert
Production-grade observability with OpenTelemetry, GPU monitoring, and real-time dashboards.
14: OpenTelemetry LLM Observability (45 min)¶
Full OpenTelemetry integration with semantic conventions.
- Complete OpenTelemetry setup (traces, metrics, logs)
- LLM-specific semantic attributes
- Distributed context propagation
- OTLP export to popular backends
- Graph-based trace visualization with Graphistry
What you'll build: Observable LLM service with distributed tracing
15: Real-time Performance Monitoring (30 min)¶
Live GPU monitoring with real-time Plotly dashboards.
- llama.cpp
/metricsendpoint integration - PyNVML GPU monitoring (VRAM, temp, power)
- Real-time Plotly FigureWidget dashboards
- Live metric updates (1-second intervals)
- Multi-panel visualization layout
What you'll build: Live performance dashboard with GPU metrics
16: Production Observability Stack (45 min)¶
Complete production stack with multi-layer telemetry.
- Full OpenTelemetry + GPU monitoring integration
- Advanced Graphistry trace visualization
- Comprehensive Plotly dashboards (2D + 3D)
- Multi-layer telemetry collection
- Production deployment patterns
What you'll build: Complete production observability stack
Learning Paths¶
Choose your path based on your goals:
Path 1: Quick Start (1 hour)¶
Goal: Get running fast
Perfect for beginners who want to start running inference quickly.
Path 2: Full Foundation (3 hours)¶
Goal: Master the fundamentals
Complete foundation for production LLM systems.
Path 3: Observability Focus ⭐ RECOMMENDED (2.5 hours)¶
Goal: Build observable systems
Best path for production observability stack.
Path 4: Graph Analytics (2.5 hours)¶
Goal: LLM-powered analytics
Build graph-based LLM applications.
Path 5: Large Model Specialist (2 hours)¶
Goal: Run 70B models
Optimize for massive models on limited hardware.
Path 6: Complete Mastery (5.5 hours)¶
Goal: Master everything
Complete llamatelemetry mastery from basics to production.
Kaggle Notebooks¶
All tutorials are available as Kaggle notebooks:
- Repository: llamatelemetry/notebooks
- Format: Jupyter notebooks (
.ipynb) - Execution: Pre-executed with outputs
- Requirements: Kaggle T4 x2, Internet enabled
Running on Kaggle¶
- Go to Kaggle
- Create new notebook or upload tutorial
- Set Accelerator to GPU T4 x2
- Enable Internet
- Run cells sequentially
Prerequisites¶
Knowledge¶
- Python: Intermediate level
- CUDA/GPU: Basic understanding
- LLMs: Familiarity with language models
- OpenTelemetry: None required (taught in tutorials)
Hardware¶
- Recommended: Kaggle dual Tesla T4 (30GB VRAM total)
- Minimum: Single Tesla T4 (15GB VRAM)
Software¶
- Python 3.11+
- CUDA 12.x (pre-installed on Kaggle)
- Internet connection
Getting Help¶
- Documentation: Full documentation
- GitHub Issues: Report problems
- Discussions: Ask questions
- Troubleshooting: Common issues
Ready to learn? Start with Tutorial 01: Quick Start or jump to the Observability Trilogy.