Skip to content

Tutorial Notebooks

llamatelemetry includes 16 comprehensive Jupyter notebooks covering foundation to production-ready observability workflows. Total time: 5.5 hours.


Quick Navigation


Foundation (Beginner)

Total time: 65 minutes | Difficulty: Beginner

Perfect starting point for llamatelemetry. Learn the basics of GGUF inference, multi-GPU setup, and quantization.

01: Quick Start (10 min)

Basic inference setup with llamatelemetry on Kaggle dual T4.

  • Install llamatelemetry v0.1.0
  • Download GGUF model from HuggingFace
  • Start llama-server with split-GPU configuration
  • Run chat completions and streaming
  • Monitor GPU memory usage

View Tutorial 01

02: Server Setup (15 min)

Advanced server configuration and optimization.

  • ServerManager API deep dive
  • GPU layer allocation strategies
  • Context size optimization
  • FlashAttention configuration
  • Batch processing settings

View Tutorial 02

03: Multi-GPU Inference (20 min)

Dual GPU tensor parallelism and split-GPU workflows.

  • Tensor split configuration (tensor_split)
  • Split-GPU architecture (GPU 0: LLM, GPU 1: Analytics)
  • VRAM distribution across GPUs
  • Performance comparison (single vs dual GPU)
  • Best practices for Kaggle T4×2

View Tutorial 03

04: GGUF Quantization (20 min)

Comprehensive guide to GGUF quantization types and selection.

  • 29 quantization types overview
  • K-Quants vs I-Quants
  • Quality vs size tradeoffs
  • VRAM estimation formulas
  • Model selection guide

View Tutorial 04


Integration (Intermediate)

Total time: 60 minutes | Difficulty: Intermediate

Integrate llamatelemetry with Unsloth fine-tuning and Graphistry visualization.

05: Unsloth Integration (30 min)

Complete workflow from fine-tuning to deployment.

  • Unsloth fine-tuning setup
  • GGUF export with llama.cpp
  • Model quantization
  • Deployment with llamatelemetry
  • End-to-end pipeline

View Tutorial 05

06: Split-GPU Graphistry (30 min)

Concurrent LLM inference and RAPIDS analytics.

  • Split-GPU architecture setup
  • RAPIDS cuGraph on GPU 1
  • Graphistry interactive visualization
  • Zero-copy data transfer
  • Performance optimization

View Tutorial 06


Advanced Applications

Total time: 65 minutes | Difficulty: Advanced

Build sophisticated LLM-powered applications with graph analytics.

07: Knowledge Graph Extraction (35 min)

LLM-powered knowledge graph construction.

  • Entity and relationship extraction
  • Knowledge graph schema design
  • Graph construction with RAPIDS cuGraph
  • Graphistry visualization
  • Real-world use cases

View Tutorial 07

08: Document Network Analysis (30 min)

Document similarity networks and clustering.

  • Document embedding with LLM
  • Similarity computation
  • Network construction
  • Community detection
  • Interactive exploration

View Tutorial 08


Optimization & Production

Total time: 120 minutes | Difficulty: Advanced-Expert

Optimize for large models and build production workflows.

09: Large Models (13B-70B) (35 min)

Run massive models on Kaggle dual T4 with advanced techniques.

  • Tensor split optimization for 13B+ models
  • Layer offloading strategies
  • Quantization selection for large models
  • Memory management
  • Performance tuning

View Tutorial 09

10: Complete Workflow (45 min)

End-to-end production pipeline from training to deployment.

  • Data preparation
  • Unsloth fine-tuning
  • GGUF conversion and quantization
  • llamatelemetry deployment
  • Monitoring and observability

View Tutorial 10

11: GGUF Neural Network Visualization (40 min)

Groundbreaking architecture visualization with 929 nodes and 981 edges.

  • GGUF file parsing
  • Neural network graph extraction
  • Layer-by-layer visualization
  • Interactive exploration with Graphistry
  • Architecture analysis

View Tutorial 11


Deep Dive

Total time: 80 minutes | Difficulty: Expert

Explore model internals with interactive visualizations.

12: Attention Mechanism Explorer (25 min)

Q-K-V decomposition and 896 attention heads visualization.

  • Attention mechanism breakdown
  • Query, Key, Value tensor analysis
  • Multi-head attention visualization
  • Attention pattern analysis
  • Layer-wise comparison

View Tutorial 12

13: Token Embedding Visualizer (30 min)

3D UMAP embedding space exploration.

  • Token embedding extraction
  • Dimensionality reduction with UMAP
  • 3D visualization with Plotly
  • Semantic clustering analysis
  • Interactive exploration

View Tutorial 13


Observability Trilogy ⭐ NEW

Total time: 120 minutes | Difficulty: Intermediate-Expert

Production-grade observability with OpenTelemetry, GPU monitoring, and real-time dashboards.

14: OpenTelemetry LLM Observability (45 min)

Full OpenTelemetry integration with semantic conventions.

  • Complete OpenTelemetry setup (traces, metrics, logs)
  • LLM-specific semantic attributes
  • Distributed context propagation
  • OTLP export to popular backends
  • Graph-based trace visualization with Graphistry

What you'll build: Observable LLM service with distributed tracing

View Tutorial 14

15: Real-time Performance Monitoring (30 min)

Live GPU monitoring with real-time Plotly dashboards.

  • llama.cpp /metrics endpoint integration
  • PyNVML GPU monitoring (VRAM, temp, power)
  • Real-time Plotly FigureWidget dashboards
  • Live metric updates (1-second intervals)
  • Multi-panel visualization layout

What you'll build: Live performance dashboard with GPU metrics

View Tutorial 15

16: Production Observability Stack (45 min)

Complete production stack with multi-layer telemetry.

  • Full OpenTelemetry + GPU monitoring integration
  • Advanced Graphistry trace visualization
  • Comprehensive Plotly dashboards (2D + 3D)
  • Multi-layer telemetry collection
  • Production deployment patterns

What you'll build: Complete production observability stack

View Tutorial 16


Learning Paths

Choose your path based on your goals:

Path 1: Quick Start (1 hour)

Goal: Get running fast

01 → 02 → 03

Perfect for beginners who want to start running inference quickly.

Path 2: Full Foundation (3 hours)

Goal: Master the fundamentals

01 → 02 → 03 → 04 → 05 → 06 → 10

Complete foundation for production LLM systems.

Goal: Build observable systems

01 → 03 → 14 → 15 → 16

Best path for production observability stack.

Path 4: Graph Analytics (2.5 hours)

Goal: LLM-powered analytics

01 → 03 → 06 → 07 → 08 → 11

Build graph-based LLM applications.

Path 5: Large Model Specialist (2 hours)

Goal: Run 70B models

01 → 03 → 04 → 09

Optimize for massive models on limited hardware.

Path 6: Complete Mastery (5.5 hours)

Goal: Master everything

01 → 02 → 03 → 04 → 05 → 06 → 07 → 08 → 09 → 10 → 11 → 12 → 13 → 14 → 15 → 16

Complete llamatelemetry mastery from basics to production.


Kaggle Notebooks

All tutorials are available as Kaggle notebooks:

  • Repository: llamatelemetry/notebooks
  • Format: Jupyter notebooks (.ipynb)
  • Execution: Pre-executed with outputs
  • Requirements: Kaggle T4 x2, Internet enabled

Running on Kaggle

  1. Go to Kaggle
  2. Create new notebook or upload tutorial
  3. Set Accelerator to GPU T4 x2
  4. Enable Internet
  5. Run cells sequentially

Prerequisites

Knowledge

  • Python: Intermediate level
  • CUDA/GPU: Basic understanding
  • LLMs: Familiarity with language models
  • OpenTelemetry: None required (taught in tutorials)

Hardware

  • Recommended: Kaggle dual Tesla T4 (30GB VRAM total)
  • Minimum: Single Tesla T4 (15GB VRAM)

Software

  • Python 3.11+
  • CUDA 12.x (pre-installed on Kaggle)
  • Internet connection

Getting Help


Ready to learn? Start with Tutorial 01: Quick Start or jump to the Observability Trilogy.