Architecture Overview¶
llamatelemetry uses a split-GPU architecture to maximize hardware utilization on Kaggle dual Tesla T4 GPUs.
Core Architecture¶
Split-GPU Design¶
llamatelemetry dedicates each GPU to a specific role:
- GPU 0 (15GB): LLM inference with llama-server
- GPU 1 (15GB): RAPIDS cuGraph + Graphistry visualization
This separation prevents memory contention and enables concurrent workloads.
┌─────────────────────────────────────────────────────────────┐
│ Kaggle Dual T4 │
├─────────────────────────┬───────────────────────────────────┤
│ GPU 0 (15GB) │ GPU 1 (15GB) │
│ │ │
│ ┌──────────────────┐ │ ┌───────────────────────────┐ │
│ │ llama-server │ │ │ RAPIDS cuGraph │ │
│ │ + FlashAttn │ │ │ + cuDF │ │
│ │ + GGUF Model │ │ │ + Graphistry │ │
│ └──────────────────┘ │ └───────────────────────────┘ │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌──────────────────┐ │ ┌───────────────────────────┐ │
│ │ OpenTelemetry │──┼─▶│ Graph Visualization │ │
│ │ Traces + Metrics │ │ │ + Interactive Dashboards │ │
│ └──────────────────┘ │ └───────────────────────────┘ │
└─────────────────────────┴───────────────────────────────────┘
Components¶
GPU 0: LLM Inference¶
- llama-server: C++ inference engine
- GGUF Model: Quantized weights
- FlashAttention v2: Optimized attention
- OpenTelemetry SDK: Telemetry collection
GPU 1: Analytics & Visualization¶
- RAPIDS cuGraph: GPU-accelerated graph processing
- cuDF: GPU dataframes
- Graphistry: Interactive visualization
- Plotly: Real-time dashboards
Data Flow¶
- Client Request → llama-server (GPU 0)
- Inference → GGUF model processing
- Telemetry → OpenTelemetry traces/metrics
- Analysis → RAPIDS cuGraph (GPU 1)
- Visualization → Graphistry/Plotly
Tensor Split¶
llamatelemetry uses tensor_split to control GPU allocation:
# GPU 0 only (recommended)
tensor_split="1.0,0.0"
# Equal split across both GPUs
tensor_split="0.5,0.5"
# Custom split (70% GPU 0, 30% GPU 1)
tensor_split="0.7,0.3"
Performance Benefits¶
| Configuration | VRAM (GPU 0) | VRAM (GPU 1) | Benefits |
|---|---|---|---|
| Split-GPU | Model only | Free for analytics | Maximum utilization |
| Dual-GPU Inference | 50% of model | 50% of model | Higher throughput |
| Single GPU | Full model | Unused | Simple but wasteful |
Next Steps¶
- Split-GPU Pattern - Detailed guide
- GPU 0 - LLM Inference - Inference setup
- GPU 1 - Analytics - Visualization setup
- Tensor Split vs NCCL - Configuration guide