Kaggle Setup Guide¶
Complete guide to setting up llamatelemetry on Kaggle dual Tesla T4 notebooks.
Enable Dual T4 GPUs¶
Step 1: Create or Open Notebook¶
- Go to Kaggle Notebooks
- Click New Notebook
- Or open an existing notebook
Step 2: Enable GPU T4 x2¶
- Click Settings (gear icon in right sidebar)
- Under Accelerator, select GPU T4 x2
- Ensure Internet is ON
- Click Save
- Restart the notebook session if prompted
Step 3: Verify GPU Setup¶
Expected output:
Recommended Workflow¶
Step 1: Install llamatelemetry¶
!pip install -q --no-cache-dir --force-reinstall \
git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0
Step 2: Download Small GGUF Model (1B-5B)¶
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="unsloth/gemma-3-1b-it-GGUF",
filename="gemma-3-1b-it-Q4_K_M.gguf",
local_dir="/kaggle/working/models",
)
Step 3: Start Server on GPU 0¶
from llamatelemetry.server import ServerManager
server = ServerManager()
server.start_server(
model_path=model_path,
gpu_layers=99,
tensor_split="1.0,0.0", # GPU 0 only
flash_attn=1,
)
Step 4: Run Inference¶
from llamatelemetry.api.client import LlamaCppClient
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
response = client.chat.create(
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=50
)
print(response.choices[0].message.content)
Step 5: Capture Telemetry and Visualize (GPU 1)¶
# Set up OpenTelemetry on GPU 1
from llamatelemetry.telemetry import setup_telemetry
tracer, meter = setup_telemetry(
service_name="llm-service",
gpu_id=1 # Use GPU 1 for analytics
)
# Visualize with Graphistry on GPU 1
from llamatelemetry.graphistry import TracesGraphistry
g = TracesGraphistry(spans=collected_spans)
g.plot(render=True)
Split-GPU Architecture¶
llamatelemetry is optimized for split-GPU workflow:
- GPU 0: LLM inference with FlashAttention
- GPU 1: RAPIDS cuGraph + Graphistry visualization
Why Split-GPU?¶
- Maximized utilization: Both GPUs working simultaneously
- No memory contention: Inference and analytics separated
- Better performance: Dedicated GPU resources for each task
- Scalability: Easy to add more visualization workloads
Configuration¶
Always use tensor_split="1.0,0.0" to keep model on GPU 0:
server.start_server(
model_path=model_path,
gpu_layers=99,
tensor_split="1.0,0.0", # GPU 0 = 100%, GPU 1 = 0%
flash_attn=1,
)
Storage Locations¶
Model Storage¶
Store models in /kaggle/working/models/:
model_path = hf_hub_download(
repo_id="unsloth/gemma-3-1b-it-GGUF",
filename="gemma-3-1b-it-Q4_K_M.gguf",
local_dir="/kaggle/working/models", # Persistent across runs
)
Binary Cache¶
llamatelemetry binaries are cached in /root/.cache/llamatelemetry/.
Output Files¶
Save outputs to /kaggle/working/:
Resource Management¶
VRAM Monitoring¶
Monitor GPU memory usage:
Recommended Model Sizes¶
For Kaggle dual T4 (15GB VRAM each):
| Model Size | Quantization | VRAM (GPU 0) | Recommended |
|---|---|---|---|
| 1B-2B | Q4_K_M | 1-2GB | ✅ Excellent |
| 3B-5B | Q4_K_M | 2-3.5GB | ✅ Good |
| 7B-8B | Q4_K_M | 4-5GB | ⚠️ OK (limited context) |
| 13B+ | Q4_K_M | 7-9GB | ❌ Not recommended |
Context Size Management¶
Adjust context size based on model:
# Small models (1B-5B): Use full 8192 context
server.start_server(
model_path=model_path,
gpu_layers=99,
ctx_size=8192,
flash_attn=1,
)
# Larger models (7B-13B): Reduce context
server.start_server(
model_path=model_path,
gpu_layers=99,
ctx_size=4096, # Reduced
flash_attn=1,
)
Best Practices¶
1. Use Explicit OTLP Endpoints¶
Always specify full paths for OTLP endpoints:
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
exporter = OTLPSpanExporter(
endpoint="http://localhost:4317/v1/traces", # Explicit path
insecure=True
)
2. Prefer In-Memory Exporters for Development¶
For testing and visualization, use in-memory exporters:
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
memory_exporter = InMemorySpanExporter()
processor = SimpleSpanProcessor(memory_exporter)
3. Clean Up Resources¶
Always stop the server when done:
4. Save Important Outputs¶
Save traces and visualizations before ending session:
import json
# Save traces
with open("/kaggle/working/traces.json", "w") as f:
json.dump(collected_spans, f)
# Save Graphistry URL
print(f"Graphistry URL: {g.url}")
Common Issues¶
Issue: Binary Download Fails¶
Solution: Clear cache and retry:
import shutil
shutil.rmtree("/root/.cache/llamatelemetry/", ignore_errors=True)
!pip install -q --no-cache-dir --force-reinstall \
git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0
Issue: Out of Memory¶
Solution: Use smaller model or reduce context:
# Use smaller model
model_path = hf_hub_download(
repo_id="unsloth/gemma-3-1b-it-GGUF", # 1B instead of 4B
filename="gemma-3-1b-it-Q4_K_M.gguf",
local_dir="/kaggle/working/models",
)
# Or reduce context size
server.start_server(
model_path=model_path,
ctx_size=4096, # Reduce from 8192
flash_attn=1,
)
Issue: Graphistry Connection Fails¶
Solution: Ensure GPU 1 is available and internet is enabled:
import graphistry
graphistry.register(
api=3,
protocol="https",
server="hub.graphistry.com",
token="YOUR_TOKEN" # Get free token from graphistry.com
)
Issue: No Spans Captured¶
Solution: Attach exporter before making requests:
# Set up telemetry FIRST
from llamatelemetry.telemetry import setup_telemetry
tracer, meter = setup_telemetry(service_name="llm-service")
# THEN make requests
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
with tracer.start_as_current_span("request"):
response = client.chat.create(...)
Performance Tips¶
1. Enable FlashAttention¶
Always enable FlashAttention for 2-3x speedup:
2. Use Q4_K_M Quantization¶
Q4_K_M provides best balance of quality and speed:
# Good balance
filename="model-Q4_K_M.gguf"
# Faster but lower quality
filename="model-Q4_0.gguf"
# Slower but higher quality
filename="model-Q5_K_M.gguf"
3. Optimize Batch Size¶
For streaming, use smaller batch sizes:
server.start_server(
model_path=model_path,
batch_size=512, # Default
ubatch_size=128, # Smaller for streaming
)
Next Steps¶
- First Steps - Understand core concepts
- Tutorial 01: Quick Start - Detailed walkthrough
- Tutorial 03: Multi-GPU Inference - Split-GPU setup
- Troubleshooting - Common issues and solutions