Skip to content

Kaggle Setup Guide

Complete guide to setting up llamatelemetry on Kaggle dual Tesla T4 notebooks.


Enable Dual T4 GPUs

Step 1: Create or Open Notebook

  1. Go to Kaggle Notebooks
  2. Click New Notebook
  3. Or open an existing notebook

Step 2: Enable GPU T4 x2

  1. Click Settings (gear icon in right sidebar)
  2. Under Accelerator, select GPU T4 x2
  3. Ensure Internet is ON
  4. Click Save
  5. Restart the notebook session if prompted

Step 3: Verify GPU Setup

!nvidia-smi -L

Expected output:

GPU 0: Tesla T4 (UUID: GPU-...)
GPU 1: Tesla T4 (UUID: GPU-...)

Step 1: Install llamatelemetry

!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

Step 2: Download Small GGUF Model (1B-5B)

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",
)

Step 3: Start Server on GPU 0

from llamatelemetry.server import ServerManager

server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",  # GPU 0 only
    flash_attn=1,
)

Step 4: Run Inference

from llamatelemetry.api.client import LlamaCppClient

client = LlamaCppClient(base_url="http://127.0.0.1:8080")
response = client.chat.create(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=50
)
print(response.choices[0].message.content)

Step 5: Capture Telemetry and Visualize (GPU 1)

# Set up OpenTelemetry on GPU 1
from llamatelemetry.telemetry import setup_telemetry

tracer, meter = setup_telemetry(
    service_name="llm-service",
    gpu_id=1  # Use GPU 1 for analytics
)

# Visualize with Graphistry on GPU 1
from llamatelemetry.graphistry import TracesGraphistry
g = TracesGraphistry(spans=collected_spans)
g.plot(render=True)

Split-GPU Architecture

llamatelemetry is optimized for split-GPU workflow:

  • GPU 0: LLM inference with FlashAttention
  • GPU 1: RAPIDS cuGraph + Graphistry visualization

Why Split-GPU?

  1. Maximized utilization: Both GPUs working simultaneously
  2. No memory contention: Inference and analytics separated
  3. Better performance: Dedicated GPU resources for each task
  4. Scalability: Easy to add more visualization workloads

Configuration

Always use tensor_split="1.0,0.0" to keep model on GPU 0:

server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",  # GPU 0 = 100%, GPU 1 = 0%
    flash_attn=1,
)

Storage Locations

Model Storage

Store models in /kaggle/working/models/:

model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",  # Persistent across runs
)

Binary Cache

llamatelemetry binaries are cached in /root/.cache/llamatelemetry/.

Output Files

Save outputs to /kaggle/working/:

output_path = "/kaggle/working/traces.json"

Resource Management

VRAM Monitoring

Monitor GPU memory usage:

!nvidia-smi --query-gpu=index,memory.used,memory.total,utilization.gpu --format=csv

For Kaggle dual T4 (15GB VRAM each):

Model Size Quantization VRAM (GPU 0) Recommended
1B-2B Q4_K_M 1-2GB ✅ Excellent
3B-5B Q4_K_M 2-3.5GB ✅ Good
7B-8B Q4_K_M 4-5GB ⚠️ OK (limited context)
13B+ Q4_K_M 7-9GB ❌ Not recommended

Context Size Management

Adjust context size based on model:

# Small models (1B-5B): Use full 8192 context
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    ctx_size=8192,
    flash_attn=1,
)

# Larger models (7B-13B): Reduce context
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    ctx_size=4096,  # Reduced
    flash_attn=1,
)

Best Practices

1. Use Explicit OTLP Endpoints

Always specify full paths for OTLP endpoints:

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317/v1/traces",  # Explicit path
    insecure=True
)

2. Prefer In-Memory Exporters for Development

For testing and visualization, use in-memory exporters:

from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter

memory_exporter = InMemorySpanExporter()
processor = SimpleSpanProcessor(memory_exporter)

3. Clean Up Resources

Always stop the server when done:

server.stop_server()

4. Save Important Outputs

Save traces and visualizations before ending session:

import json

# Save traces
with open("/kaggle/working/traces.json", "w") as f:
    json.dump(collected_spans, f)

# Save Graphistry URL
print(f"Graphistry URL: {g.url}")

Common Issues

Issue: Binary Download Fails

Solution: Clear cache and retry:

import shutil
shutil.rmtree("/root/.cache/llamatelemetry/", ignore_errors=True)

!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

Issue: Out of Memory

Solution: Use smaller model or reduce context:

# Use smaller model
model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",  # 1B instead of 4B
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",
)

# Or reduce context size
server.start_server(
    model_path=model_path,
    ctx_size=4096,  # Reduce from 8192
    flash_attn=1,
)

Issue: Graphistry Connection Fails

Solution: Ensure GPU 1 is available and internet is enabled:

import graphistry
graphistry.register(
    api=3,
    protocol="https",
    server="hub.graphistry.com",
    token="YOUR_TOKEN"  # Get free token from graphistry.com
)

Issue: No Spans Captured

Solution: Attach exporter before making requests:

# Set up telemetry FIRST
from llamatelemetry.telemetry import setup_telemetry
tracer, meter = setup_telemetry(service_name="llm-service")

# THEN make requests
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
with tracer.start_as_current_span("request"):
    response = client.chat.create(...)

Performance Tips

1. Enable FlashAttention

Always enable FlashAttention for 2-3x speedup:

server.start_server(
    model_path=model_path,
    flash_attn=1,  # Enable
)

2. Use Q4_K_M Quantization

Q4_K_M provides best balance of quality and speed:

# Good balance
filename="model-Q4_K_M.gguf"

# Faster but lower quality
filename="model-Q4_0.gguf"

# Slower but higher quality
filename="model-Q5_K_M.gguf"

3. Optimize Batch Size

For streaming, use smaller batch sizes:

server.start_server(
    model_path=model_path,
    batch_size=512,  # Default
    ubatch_size=128,  # Smaller for streaming
)

Next Steps