Skip to content

Architecture Overview

llamatelemetry uses a split-GPU architecture to maximize hardware utilization on Kaggle dual Tesla T4 GPUs.


Core Architecture

Split-GPU Design

llamatelemetry dedicates each GPU to a specific role:

  • GPU 0 (15GB): LLM inference with llama-server
  • GPU 1 (15GB): RAPIDS cuGraph + Graphistry visualization

This separation prevents memory contention and enables concurrent workloads.

┌─────────────────────────────────────────────────────────────┐
│                     Kaggle Dual T4                         │
├─────────────────────────┬───────────────────────────────────┤
│       GPU 0 (15GB)      │        GPU 1 (15GB)              │
│                         │                                   │
│  ┌──────────────────┐  │  ┌───────────────────────────┐   │
│  │  llama-server    │  │  │  RAPIDS cuGraph          │   │
│  │  + FlashAttn     │  │  │  + cuDF                   │   │
│  │  + GGUF Model    │  │  │  + Graphistry             │   │
│  └──────────────────┘  │  └───────────────────────────┘   │
│           │             │              │                    │
│           ▼             │              ▼                    │
│  ┌──────────────────┐  │  ┌───────────────────────────┐   │
│  │ OpenTelemetry    │──┼─▶│  Graph Visualization      │   │
│  │ Traces + Metrics │  │  │  + Interactive Dashboards │   │
│  └──────────────────┘  │  └───────────────────────────┘   │
└─────────────────────────┴───────────────────────────────────┘

Components

GPU 0: LLM Inference

  • llama-server: C++ inference engine
  • GGUF Model: Quantized weights
  • FlashAttention v2: Optimized attention
  • OpenTelemetry SDK: Telemetry collection

Learn More

GPU 1: Analytics & Visualization

  • RAPIDS cuGraph: GPU-accelerated graph processing
  • cuDF: GPU dataframes
  • Graphistry: Interactive visualization
  • Plotly: Real-time dashboards

Learn More


Data Flow

  1. Client Request → llama-server (GPU 0)
  2. Inference → GGUF model processing
  3. Telemetry → OpenTelemetry traces/metrics
  4. Analysis → RAPIDS cuGraph (GPU 1)
  5. Visualization → Graphistry/Plotly

View Data Flow Details


Tensor Split

llamatelemetry uses tensor_split to control GPU allocation:

# GPU 0 only (recommended)
tensor_split="1.0,0.0"

# Equal split across both GPUs
tensor_split="0.5,0.5"

# Custom split (70% GPU 0, 30% GPU 1)
tensor_split="0.7,0.3"

Tensor Split Guide


Performance Benefits

Configuration VRAM (GPU 0) VRAM (GPU 1) Benefits
Split-GPU Model only Free for analytics Maximum utilization
Dual-GPU Inference 50% of model 50% of model Higher throughput
Single GPU Full model Unused Simple but wasteful

Next Steps