Kaggle Setup Guide¶

Complete guide to setting up llamatelemetry on Kaggle dual Tesla T4 notebooks.

Enable Dual T4 GPUs¶

Step 1: Create or Open Notebook¶

Go to Kaggle Notebooks
Click New Notebook
Or open an existing notebook

Step 2: Enable GPU T4 x2¶

Click Settings (gear icon in right sidebar)
Under Accelerator, select GPU T4 x2
Ensure Internet is ON
Click Save
Restart the notebook session if prompted

Step 3: Verify GPU Setup¶

!nvidia-smi -L

Expected output:

GPU 0: Tesla T4 (UUID: GPU-...)
GPU 1: Tesla T4 (UUID: GPU-...)

Recommended Workflow¶

Step 1: Install llamatelemetry¶

!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

Step 2: Download Small GGUF Model (1B-5B)¶

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",
)

Step 3: Start Server on GPU 0¶

from llamatelemetry.server import ServerManager

server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",  # GPU 0 only
    flash_attn=1,
)

Step 4: Run Inference¶

from llamatelemetry.api.client import LlamaCppClient

client = LlamaCppClient(base_url="http://127.0.0.1:8080")
response = client.chat.create(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=50
)
print(response.choices[0].message.content)

Step 5: Capture Telemetry and Visualize (GPU 1)¶

# Set up OpenTelemetry on GPU 1
from llamatelemetry.telemetry import setup_telemetry

tracer, meter = setup_telemetry(
    service_name="llm-service",
    gpu_id=1  # Use GPU 1 for analytics
)

# Visualize with Graphistry on GPU 1
from llamatelemetry.graphistry import TracesGraphistry
g = TracesGraphistry(spans=collected_spans)
g.plot(render=True)

Split-GPU Architecture¶

llamatelemetry is optimized for split-GPU workflow:

GPU 0: LLM inference with FlashAttention
GPU 1: RAPIDS cuGraph + Graphistry visualization

Why Split-GPU?¶

Maximized utilization: Both GPUs working simultaneously
No memory contention: Inference and analytics separated
Better performance: Dedicated GPU resources for each task
Scalability: Easy to add more visualization workloads

Configuration¶

Always use tensor_split="1.0,0.0" to keep model on GPU 0:

server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",  # GPU 0 = 100%, GPU 1 = 0%
    flash_attn=1,
)

Storage Locations¶

Model Storage¶

Store models in /kaggle/working/models/:

model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",  # Persistent across runs
)

Binary Cache¶

llamatelemetry binaries are cached in /root/.cache/llamatelemetry/.

Output Files¶

Save outputs to /kaggle/working/:

output_path = "/kaggle/working/traces.json"

Resource Management¶

VRAM Monitoring¶

Monitor GPU memory usage:

!nvidia-smi --query-gpu=index,memory.used,memory.total,utilization.gpu --format=csv

Recommended Model Sizes¶

For Kaggle dual T4 (15GB VRAM each):

Model Size	Quantization	VRAM (GPU 0)	Recommended
1B-2B	Q4_K_M	1-2GB	✅ Excellent
3B-5B	Q4_K_M	2-3.5GB	✅ Good
7B-8B	Q4_K_M	4-5GB	⚠️ OK (limited context)
13B+	Q4_K_M	7-9GB	❌ Not recommended

Context Size Management¶

Adjust context size based on model:

# Small models (1B-5B): Use full 8192 context
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    ctx_size=8192,
    flash_attn=1,
)

# Larger models (7B-13B): Reduce context
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    ctx_size=4096,  # Reduced
    flash_attn=1,
)

Best Practices¶

1. Use Explicit OTLP Endpoints¶

Always specify full paths for OTLP endpoints:

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317/v1/traces",  # Explicit path
    insecure=True
)

2. Prefer In-Memory Exporters for Development¶

For testing and visualization, use in-memory exporters:

from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter

memory_exporter = InMemorySpanExporter()
processor = SimpleSpanProcessor(memory_exporter)

3. Clean Up Resources¶

Always stop the server when done:

server.stop_server()

4. Save Important Outputs¶

Save traces and visualizations before ending session:

import json

# Save traces
with open("/kaggle/working/traces.json", "w") as f:
    json.dump(collected_spans, f)

# Save Graphistry URL
print(f"Graphistry URL: {g.url}")

Common Issues¶

Issue: Binary Download Fails¶

Solution: Clear cache and retry:

import shutil
shutil.rmtree("/root/.cache/llamatelemetry/", ignore_errors=True)

!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

Issue: Out of Memory¶

Solution: Use smaller model or reduce context:

# Use smaller model
model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",  # 1B instead of 4B
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",
)

# Or reduce context size
server.start_server(
    model_path=model_path,
    ctx_size=4096,  # Reduce from 8192
    flash_attn=1,
)

Issue: Graphistry Connection Fails¶

Solution: Ensure GPU 1 is available and internet is enabled:

import graphistry
graphistry.register(
    api=3,
    protocol="https",
    server="hub.graphistry.com",
    token="YOUR_TOKEN"  # Get free token from graphistry.com
)

Issue: No Spans Captured¶

Solution: Attach exporter before making requests:

# Set up telemetry FIRST
from llamatelemetry.telemetry import setup_telemetry
tracer, meter = setup_telemetry(service_name="llm-service")

# THEN make requests
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
with tracer.start_as_current_span("request"):
    response = client.chat.create(...)

Performance Tips¶

1. Enable FlashAttention¶

Always enable FlashAttention for 2-3x speedup:

server.start_server(
    model_path=model_path,
    flash_attn=1,  # Enable
)

2. Use Q4_K_M Quantization¶

Q4_K_M provides best balance of quality and speed:

# Good balance
filename="model-Q4_K_M.gguf"

# Faster but lower quality
filename="model-Q4_0.gguf"

# Slower but higher quality
filename="model-Q5_K_M.gguf"

3. Optimize Batch Size¶

For streaming, use smaller batch sizes:

server.start_server(
    model_path=model_path,
    batch_size=512,  # Default
    ubatch_size=128,  # Smaller for streaming
)

Next Steps¶

First Steps - Understand core concepts
Tutorial 01: Quick Start - Detailed walkthrough
Tutorial 03: Multi-GPU Inference - Split-GPU setup
Troubleshooting - Common issues and solutions