Quick Start¶

Get running with llamatelemetry in under 5 minutes on Kaggle dual T4 GPUs.

Prerequisites¶

Kaggle notebook with GPU T4 x2 enabled
Internet enabled
Python 3.11+

Step 1: Install llamatelemetry¶

!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

This will:

Install the Python SDK
Download pre-built CUDA binaries (~961 MB) on first import
Configure FlashAttention support

Step 2: Verify GPU Environment¶

import subprocess

# Check GPUs
result = subprocess.run(["nvidia-smi", "-L"], capture_output=True, text=True)
print(result.stdout)

# Verify llamatelemetry
import llamatelemetry
from llamatelemetry.api.multigpu import gpu_count

print(f"llamatelemetry version: {llamatelemetry.__version__}")
print(f"GPUs detected: {gpu_count()}")

Expected output:

GPU 0: Tesla T4 (UUID: GPU-...)
GPU 1: Tesla T4 (UUID: GPU-...)

llamatelemetry version: 0.1.0
GPUs detected: 2

Step 3: Download a GGUF Model¶

Download a small model optimized for T4 GPUs:

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",
)

print(f"Model downloaded: {model_path}")

Model details:

Model: Gemma 3-1B Instruct
Quantization: Q4_K_M (4-bit, balanced quality)
Size: ~1.2GB
VRAM: ~1.2GB (GPU 0)
Performance: ~85 tokens/sec on T4

Step 4: Start llama-server¶

Launch the llama-server with split-GPU configuration:

from llamatelemetry.server import ServerManager

server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,              # Load all layers to GPU
    tensor_split="1.0,0.0",     # GPU 0 only (GPU 1 free for analytics)
    flash_attn=1,               # Enable FlashAttention
)

Server starts on http://127.0.0.1:8080 with:

GPU 0: LLM inference with FlashAttention
GPU 1: Available for RAPIDS/Graphistry
Context: 8192 tokens
API: OpenAI-compatible

Step 5: Run Inference¶

Test the server with a simple chat completion:

from llamatelemetry.api.client import LlamaCppClient

client = LlamaCppClient(base_url="http://127.0.0.1:8080")

response = client.chat.create(
    messages=[
        {"role": "user", "content": "What is CUDA? Explain in 2 sentences."}
    ],
    max_tokens=100,
    temperature=0.7
)

print("Response:", response.choices[0].message.content)
print(f"Tokens: {response.usage.completion_tokens}")

Expected output:

Response: CUDA (Compute Unified Device Architecture) is a parallel
computing platform and programming model developed by NVIDIA that
allows developers to utilize the massive processing power of their
GPUs for general-purpose computations...

Tokens: 76

Step 6: Test Streaming¶

Try streaming responses for real-time generation:

print("Streaming response:\n")

for chunk in client.chat.create(
    messages=[
        {"role": "user", "content": "Write a Python function to calculate factorial."}
    ],
    max_tokens=200,
    temperature=0.3,
    stream=True
):
    if hasattr(chunk, 'choices') and chunk.choices:
        delta = chunk.choices[0].delta
        if hasattr(delta, 'content') and delta.content:
            print(delta.content, end="", flush=True)

print("\n\nDone!")

Step 7: Monitor GPU Usage¶

Check VRAM usage across both GPUs:

!nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu --format=csv

Expected output:

index, name, memory.used [MiB], memory.total [MiB], utilization.gpu [%]
0, Tesla T4, 1307 MiB, 15360 MiB, 0 %
1, Tesla T4, 0 MiB, 15360 MiB, 0 %

Note:

GPU 0: Running llama-server (~1.3GB VRAM)
GPU 1: Free for RAPIDS/Graphistry visualization

Step 8: Cleanup¶

Stop the server and release resources:

server.stop_server()
print("Server stopped. Resources freed.")

# Verify GPU memory released
!nvidia-smi --query-gpu=index,memory.used --format=csv

Complete Example¶

Here's the complete workflow in one cell:

# 1. Install
!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

# 2. Download model
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",
)

# 3. Start server
from llamatelemetry.server import ServerManager
server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",
    flash_attn=1,
)

# 4. Run inference
from llamatelemetry.api.client import LlamaCppClient
client = LlamaCppClient(base_url="http://127.0.0.1:8080")

response = client.chat.create(
    messages=[{"role": "user", "content": "What is CUDA?"}],
    max_tokens=80,
)

print(response.choices[0].message.content)

# 5. Cleanup
server.stop_server()

What's Next?¶

Recommended Learning Path¶

First Steps - Understand core concepts
Tutorial 01: Quick Start - Detailed walkthrough
Tutorial 02: Server Setup - Advanced configuration
Tutorial 03: Multi-GPU Inference - Dual GPU setup

Observability Features ⭐ NEW¶

Explore the Observability Trilogy (notebooks 14-16):

Tutorial 14: OpenTelemetry - Full OTel integration
Tutorial 15: Real-time Monitoring - Live dashboards
Tutorial 16: Production Stack - Complete observability

Advanced Topics¶

Architecture Overview - Understand split-GPU design
API Reference - Complete API documentation
Performance Guide - Optimization tips

Common Issues¶

Server won't start¶

Check if port 8080 is already in use:

import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = sock.connect_ex(('127.0.0.1', 8080))
if result == 0:
    print("Port 8080 is in use")
else:
    print("Port 8080 is free")
sock.close()

Out of memory¶

Use a smaller model or reduce context size:

server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",
    flash_attn=1,
    ctx_size=4096,  # Reduce from default 8192
)

Slow inference¶

Enable FlashAttention and ensure GPU 0 is dedicated to inference:

server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",  # GPU 0 only
    flash_attn=1,            # Enable FlashAttention
)

Need Help?¶

Troubleshooting Guide - Common issues and solutions
GitHub Issues - Report bugs
Discussions - Ask questions