Skip to content

03: Multi-GPU Inference

Kaggle Time Level

Overview

Phase: Foundation Difficulty: Beginner Duration: 20 min

Learn multi-gpu inference for LLM inference on Kaggle dual Tesla T4 GPUs.

Learning Objectives

  • ✅ Configure tensor-split across 2 GPUs
  • ✅ Test different split ratios (50/50, 70/30, 100/0)
  • ✅ Set up split-GPU mode (LLM + RAPIDS)
  • ✅ Benchmark multi-GPU performance
  • ✅ Verify GPU memory distribution

Topics Covered

  • 📚 Tensor parallelism
  • 📚 Dual T4
  • 📚 GPU memory distribution

Prerequisites

  • llamatelemetry v0.1.0 installed
  • Kaggle dual Tesla T4 environment (30GB VRAM)
  • Basic Python knowledge
  • Completed Tutorial 01

Quick Start

# Install llamatelemetry
!pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

# Verify GPU environment
!nvidia-smi --query-gpu=index,name,memory.total --format=csv

Key Concepts

Server Configuration

llamatelemetry uses llama.cpp's llama-server for high-performance GGUF model inference.

from llamatelemetry.server import ServerManager

server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    ctx_size=4096,
    tensor_split="0.5,0.5"  # Dual GPU split
)

Tensor Split Configuration

Equal Split (50/50):

tensor_split="0.5,0.5"  # Balanced across both GPUs

Asymmetric Split (70/30):

tensor_split="0.7,0.3"  # More on GPU 0

Split-GPU Mode (100/0):

tensor_split="1.0,0.0"  # GPU 0: LLM, GPU 1: RAPIDS

Step-by-Step Guide

Step 1: Environment Setup

Verify your Kaggle environment has dual Tesla T4 GPUs.

Step 2: Installation

Install llamatelemetry v0.1.0 with all dependencies.

Step 3: Configuration

Configure the server for optimal performance on Kaggle T4.

Step 4: Implementation

Implement the tutorial objectives step by step.

Step 5: Verification

Test and verify the implementation works correctly.

Expected Output

After completing this tutorial, you should be able to:

  • ✅ Configure tensor-split across 2 GPUs
  • ✅ Test different split ratios (50/50, 70/30, 100/0)
  • ✅ Set up split-GPU mode (LLM + RAPIDS)
  • ✅ Benchmark multi-GPU performance
  • ✅ Verify GPU memory distribution

Common Issues

Issue: Server Fails to Start

Solution: Check GPU memory and ensure no other processes are using the GPUs.

nvidia-smi

Issue: Out of Memory

Solution: Reduce context size or use lower quantization.

ctx_size=2048  # Instead of 4096

Performance Benchmarks

Expected performance on Kaggle dual Tesla T4:

Model Quantization Speed VRAM
Gemma-3 1B Q4_K_M ~85 tok/s ~1 GB
Gemma-3 4B Q4_K_M ~42 tok/s ~2.5 GB
Llama-3.1 8B Q4_K_M ~25 tok/s ~5 GB

Next Steps

Resources

Full Notebook

View and run the complete notebook on Kaggle:

Kaggle


Tutorial {num}/{len(NOTEBOOKS)} | llamatelemetry v0.1.0 | Back to Tutorials