Skip to content

Tutorial 01: Quickstart Guide

Open in Kaggle

Overview

Get started with llamatelemetry v0.1.0 in under 10 minutes. This tutorial walks you through installing llamatelemetry, launching llama-server with multi-GPU support, and running your first inference on Kaggle's dual Tesla T4 GPUs.

Learning Objectives

  • Install llamatelemetry v0.1.0 with pre-built CUDA binaries
  • Verify dual GPU environment and CUDA compatibility
  • Download and configure GGUF models from HuggingFace
  • Start llama-server with multi-GPU tensor-split
  • Perform chat completions using OpenAI-compatible API
  • Monitor GPU memory usage across multiple devices

Prerequisites

  • Kaggle notebook with 2× Tesla T4 GPUs enabled
  • Basic Python knowledge
  • HuggingFace account (optional, for downloading models)

Time Estimate

10-15 minutes

Key Concepts

  • llamatelemetry: CUDA-accelerated inference backend for GGUF models
  • Tensor Split: Distributing model layers across multiple GPUs
  • GGUF: Quantized model format for efficient inference
  • llama-server: OpenAI-compatible API server for local LLM inference

Step 1: Verify Kaggle GPU Environment

First, validate that you have dual Tesla T4 GPUs with the required CUDA version and VRAM capacity.

import subprocess
import os

print("="*70)
print("🔍 KAGGLE GPU ENVIRONMENT CHECK")
print("="*70)

# Check nvidia-smi
result = subprocess.run(["nvidia-smi", "-L"], capture_output=True, text=True)
gpu_lines = [l for l in result.stdout.strip().split("\n") if l.startswith("GPU")]
print(f"\n📊 Detected GPUs: {len(gpu_lines)}")
for line in gpu_lines:
    print(f"   {line}")

# Check CUDA version
print("\n📊 CUDA Version:")
!nvcc --version | grep release

# Check total VRAM
print("\n📊 VRAM Summary:")
!nvidia-smi --query-gpu=index,name,memory.total --format=csv

# Verify we have 2 GPUs
if len(gpu_lines) >= 2:
    print("\n✅ Multi-GPU environment confirmed! Ready for llamatelemetry v0.1.0.")
else:
    print("\n⚠️ WARNING: Less than 2 GPUs detected!")
    print("   Enable 'GPU T4 x2' in Kaggle notebook settings.")

Expected Output:

======================================================================
🔍 KAGGLE GPU ENVIRONMENT CHECK
======================================================================

📊 Detected GPUs: 2
   GPU 0: Tesla T4 (UUID: GPU-7ba2c248-3b76-b125-7f27-7ac05c7faf42)
   GPU 1: Tesla T4 (UUID: GPU-27208f88-2843-f816-0b43-cf8b64926aac)

📊 CUDA Version:
Cuda compilation tools, release 12.5, V12.5.82

📊 VRAM Summary:
index, name, memory.total [MiB]
0, Tesla T4, 15360 MiB
1, Tesla T4, 15360 MiB

✅ Multi-GPU environment confirmed! Ready for llamatelemetry v0.1.0.


Step 2: Install llamatelemetry v0.1.0

Install llamatelemetry from GitHub with pre-built CUDA binaries optimized for Kaggle's dual T4 GPUs.

%%time
# Install llamatelemetry v0.1.0 from GitHub
print("📦 Installing llamatelemetry v0.1.0...")
!pip install -q --no-cache-dir --force-reinstall git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

# Verify installation
import llamatelemetry
print(f"\n✅ llamatelemetry {llamatelemetry.__version__} installed!")

# Check llamatelemetry status
from llamatelemetry import check_cuda_available, get_cuda_device_info
from llamatelemetry.api.multigpu import gpu_count

cuda_info = get_cuda_device_info()
print(f"\n📊 llamatelemetry Status:")
print(f"   CUDA Available: {check_cuda_available()}")
print(f"   GPUs: {gpu_count()}")
if cuda_info:
    print(f"   CUDA Version: {cuda_info.get('cuda_version', 'N/A')}")

What Happens: - llamatelemetry automatically detects your GPU (Tesla T4) - Downloads ~961 MB of pre-compiled CUDA binaries from HuggingFace - Installs FlashAttention-enabled llama-server for faster inference - Verifies binary integrity with SHA256 checksums


Step 3: Download GGUF Model

Download a quantized model from HuggingFace Hub. We'll use Gemma 3-4B Instruct in Q4_K_M quantization.

%%time
from huggingface_hub import hf_hub_download
import os

# Model selection - optimized for 15GB VRAM
MODEL_REPO = "unsloth/gemma-3-4b-it-GGUF"
MODEL_FILE = "gemma-3-4b-it-Q4_K_M.gguf"

print(f"📥 Downloading {MODEL_FILE}...")
print(f"   Repository: {MODEL_REPO}")
print(f"   Expected size: ~2.5GB")

# Download to Kaggle working directory
model_path = hf_hub_download(
    repo_id=MODEL_REPO,
    filename=MODEL_FILE,
    local_dir="/kaggle/working/models"
)

print(f"\n✅ Model downloaded: {model_path}")

# Show model size
size_gb = os.path.getsize(model_path) / (1024**3)
print(f"   Size: {size_gb:.2f} GB")

Expected Output:

📥 Downloading gemma-3-4b-it-Q4_K_M.gguf...
   Repository: unsloth/gemma-3-4b-it-GGUF
   Expected size: ~2.5GB

✅ Model downloaded: /kaggle/working/models/gemma-3-4b-it-Q4_K_M.gguf
   Size: 2.32 GB
CPU times: user 5.63 s, sys: 8.75 s, total: 14.4 s
Wall time: 5.44 s


Step 4: Start llama-server with Multi-GPU Configuration

Launch llama-server using dual-GPU tensor-split configuration for maximum throughput.

from llamatelemetry.server import ServerManager
from llamatelemetry.api.multigpu import kaggle_t4_dual_config

# Get optimized configuration for Kaggle T4×2
config = kaggle_t4_dual_config()

print("🚀 Starting llama-server with Multi-GPU configuration...")
print(f"   Model: {model_path}")
print(f"   GPU Layers: {config.n_gpu_layers} (all layers)")
print(f"   Context Size: {config.ctx_size}")
print(f"   Tensor Split: {config.tensor_split} (equal across 2 GPUs)")
print(f"   Flash Attention: {config.flash_attention}")

# Create server manager
server = ServerManager(server_url="http://127.0.0.1:8080")

# Start server with multi-GPU configuration
tensor_split_str = ",".join(str(x) for x in config.tensor_split)

server.start_server(
    model_path=model_path,
    host="127.0.0.1",
    port=8080,
    gpu_layers=config.n_gpu_layers,
    ctx_size=config.ctx_size,
    timeout=120,
    verbose=True,
    flash_attn=1 if config.flash_attention else 0,
    split_mode="layer",
    tensor_split=tensor_split_str,
)

print("\n✅ llama-server is ready with dual T4 GPUs!")
print(f"   API endpoint: http://127.0.0.1:8080")

Expected Output:

🚀 Starting llama-server with Multi-GPU configuration...
   Model: /kaggle/working/models/gemma-3-4b-it-Q4_K_M.gguf
   GPU Layers: -1 (all layers)
   Context Size: 8192
   Tensor Split: [0.5, 0.5] (equal across 2 GPUs)
   Flash Attention: True

Starting llama-server...
  Executable: /usr/local/lib/python3.12/dist-packages/llamatelemetry/binaries/cuda12/llama-server
  Model: gemma-3-4b-it-Q4_K_M.gguf
  GPU Layers: -1
  Context Size: 8192
  Server URL: http://127.0.0.1:8080
Waiting for server to be ready........ ✓ Ready in 5.1s

✅ llama-server is ready with dual T4 GPUs!
   API endpoint: http://127.0.0.1:8080


Step 5: Test Chat Completion

Run your first inference using the OpenAI-compatible chat completion API.

from llamatelemetry.api.client import LlamaCppClient

# Create client
client = LlamaCppClient(base_url="http://127.0.0.1:8080")

# Test simple completion
print("💬 Testing inference...\n")

response = client.chat.create(
    messages=[
        {"role": "user", "content": "What is CUDA? Explain in 2 sentences."}
    ],
    max_tokens=100,
    temperature=0.7
)

print("📝 Response:")
print(response.choices[0].message.content)

print(f"\n📊 Stats:")
print(f"   Tokens generated: {response.usage.completion_tokens}")
print(f"   Total tokens: {response.usage.total_tokens}")

Expected Output:

💬 Testing inference...

📝 Response:
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming
model developed by NVIDIA that allows developers to utilize the massive processing power of
their GPUs for general-purpose computations – essentially turning graphics cards into powerful
accelerators. It enables you to write code that runs simultaneously on many GPU cores,
dramatically speeding up tasks like machine learning, scientific simulations, and image/video
processing.

📊 Stats:
   Tokens generated: 76
   Total tokens: 95


Step 6: Test Streaming Responses

Demonstrate streaming completion for real-time token generation.

# Streaming example
print("💬 Streaming response...\n")

for chunk in client.chat.create(
    messages=[
        {"role": "user", "content": "Write a Python function to calculate factorial."}
    ],
    max_tokens=200,
    temperature=0.3,
    stream=True  # Enable streaming
):
    if hasattr(chunk, 'choices') and chunk.choices:
        delta = chunk.choices[0].delta
        if hasattr(delta, 'content') and delta.content:
            print(delta.content, end="", flush=True)

print("\n\n✅ Streaming complete!")

Step 7: Monitor GPU Memory Usage

Check VRAM consumption across both GPUs to verify multi-GPU distribution.

# Check GPU memory usage
print("📊 GPU Memory Usage:")
print("="*60)
!nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu --format=csv

print("\n💡 Note:")
print("   GPU 0: llama-server (LLM inference)")
print("   GPU 1: Available for RAPIDS/Graphistry")

Expected Output:

📊 GPU Memory Usage:
============================================================
index, name, memory.used [MiB], memory.total [MiB], utilization.gpu [%]
0, Tesla T4, 1307 MiB, 15360 MiB, 0 %
1, Tesla T4, 1797 MiB, 15360 MiB, 0 %

💡 Note:
   GPU 0: llama-server (LLM inference)
   GPU 1: Available for RAPIDS/Graphistry


Step 8: Cleanup and Release Resources

Stop llama-server and free GPU memory.

# Stop the server
print("🛑 Stopping llama-server...")
server.stop_server()
print("\n✅ Server stopped. Resources freed.")

# Verify GPU memory is released
print("\n📊 GPU Memory After Cleanup:")
!nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv

Expected Output:

🛑 Stopping llama-server...

✅ Server stopped. Resources freed.

📊 GPU Memory After Cleanup:
index, memory.used [MiB], memory.total [MiB]
0, 0 MiB, 15360 MiB
1, 0 MiB, 15360 MiB


Summary

You've successfully:

  1. ✅ Verified Kaggle GPU environment
  2. ✅ Installed llamatelemetry v0.1.0
  3. ✅ Downloaded a GGUF model
  4. ✅ Started llama-server with multi-GPU support
  5. ✅ Ran inference with chat completion
  6. ✅ Used streaming responses

Next Steps

Troubleshooting

Server fails to start

  • Ensure you have 2 GPUs enabled in Kaggle settings
  • Check that model path exists: os.path.exists(model_path)
  • Verify CUDA is available: torch.cuda.is_available()

Out of memory errors

  • Try a smaller model (e.g., Gemma 3-1B)
  • Reduce context size: ctx_size=4096
  • Use single GPU: tensor_split="1.0,0.0"

Slow inference

  • Enable FlashAttention: flash_attn=1
  • Increase batch size: batch_size=1024
  • Use Q4_K_M quantization for speed

llamatelemetry v0.1.0 | CUDA 12 Inference Backend for Unsloth