Architecture Overview¶

llamatelemetry uses a split-GPU architecture to maximize hardware utilization on Kaggle dual Tesla T4 GPUs.

Core Architecture¶

Split-GPU Design¶

llamatelemetry dedicates each GPU to a specific role:

GPU 0 (15GB): LLM inference with llama-server
GPU 1 (15GB): RAPIDS cuGraph + Graphistry visualization

This separation prevents memory contention and enables concurrent workloads.

┌─────────────────────────────────────────────────────────────┐
│                     Kaggle Dual T4                         │
├─────────────────────────┬───────────────────────────────────┤
│       GPU 0 (15GB)      │        GPU 1 (15GB)              │
│                         │                                   │
│  ┌──────────────────┐  │  ┌───────────────────────────┐   │
│  │  llama-server    │  │  │  RAPIDS cuGraph          │   │
│  │  + FlashAttn     │  │  │  + cuDF                   │   │
│  │  + GGUF Model    │  │  │  + Graphistry             │   │
│  └──────────────────┘  │  └───────────────────────────┘   │
│           │             │              │                    │
│           ▼             │              ▼                    │
│  ┌──────────────────┐  │  ┌───────────────────────────┐   │
│  │ OpenTelemetry    │──┼─▶│  Graph Visualization      │   │
│  │ Traces + Metrics │  │  │  + Interactive Dashboards │   │
│  └──────────────────┘  │  └───────────────────────────┘   │
└─────────────────────────┴───────────────────────────────────┘

Components¶

GPU 0: LLM Inference¶

llama-server: C++ inference engine
GGUF Model: Quantized weights
FlashAttention v2: Optimized attention
OpenTelemetry SDK: Telemetry collection

Learn More

GPU 1: Analytics & Visualization¶

RAPIDS cuGraph: GPU-accelerated graph processing
cuDF: GPU dataframes
Graphistry: Interactive visualization
Plotly: Real-time dashboards

Learn More

Data Flow¶

Client Request → llama-server (GPU 0)
Inference → GGUF model processing
Telemetry → OpenTelemetry traces/metrics
Analysis → RAPIDS cuGraph (GPU 1)
Visualization → Graphistry/Plotly

View Data Flow Details

Tensor Split¶

llamatelemetry uses tensor_split to control GPU allocation:

# GPU 0 only (recommended)
tensor_split="1.0,0.0"

# Equal split across both GPUs
tensor_split="0.5,0.5"

# Custom split (70% GPU 0, 30% GPU 1)
tensor_split="0.7,0.3"

Tensor Split Guide

Performance Benefits¶

Configuration	VRAM (GPU 0)	VRAM (GPU 1)	Benefits
Split-GPU	Model only	Free for analytics	Maximum utilization
Dual-GPU Inference	50% of model	50% of model	Higher throughput
Single GPU	Full model	Unused	Simple but wasteful

Next Steps¶

Split-GPU Pattern - Detailed guide
GPU 0 - LLM Inference - Inference setup
GPU 1 - Analytics - Visualization setup
Tensor Split vs NCCL - Configuration guide