Quick Start¶
Get running with llamatelemetry in under 5 minutes on Kaggle dual T4 GPUs.
Prerequisites¶
- Kaggle notebook with GPU T4 x2 enabled
- Internet enabled
- Python 3.11+
Step 1: Install llamatelemetry¶
!pip install -q --no-cache-dir --force-reinstall \
git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0
This will:
- Install the Python SDK
- Download pre-built CUDA binaries (~961 MB) on first import
- Configure FlashAttention support
Step 2: Verify GPU Environment¶
import subprocess
# Check GPUs
result = subprocess.run(["nvidia-smi", "-L"], capture_output=True, text=True)
print(result.stdout)
# Verify llamatelemetry
import llamatelemetry
from llamatelemetry.api.multigpu import gpu_count
print(f"llamatelemetry version: {llamatelemetry.__version__}")
print(f"GPUs detected: {gpu_count()}")
Expected output:
GPU 0: Tesla T4 (UUID: GPU-...)
GPU 1: Tesla T4 (UUID: GPU-...)
llamatelemetry version: 0.1.0
GPUs detected: 2
Step 3: Download a GGUF Model¶
Download a small model optimized for T4 GPUs:
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="unsloth/gemma-3-1b-it-GGUF",
filename="gemma-3-1b-it-Q4_K_M.gguf",
local_dir="/kaggle/working/models",
)
print(f"Model downloaded: {model_path}")
Model details:
- Model: Gemma 3-1B Instruct
- Quantization: Q4_K_M (4-bit, balanced quality)
- Size: ~1.2GB
- VRAM: ~1.2GB (GPU 0)
- Performance: ~85 tokens/sec on T4
Step 4: Start llama-server¶
Launch the llama-server with split-GPU configuration:
from llamatelemetry.server import ServerManager
server = ServerManager()
server.start_server(
model_path=model_path,
gpu_layers=99, # Load all layers to GPU
tensor_split="1.0,0.0", # GPU 0 only (GPU 1 free for analytics)
flash_attn=1, # Enable FlashAttention
)
Server starts on http://127.0.0.1:8080 with:
- GPU 0: LLM inference with FlashAttention
- GPU 1: Available for RAPIDS/Graphistry
- Context: 8192 tokens
- API: OpenAI-compatible
Step 5: Run Inference¶
Test the server with a simple chat completion:
from llamatelemetry.api.client import LlamaCppClient
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
response = client.chat.create(
messages=[
{"role": "user", "content": "What is CUDA? Explain in 2 sentences."}
],
max_tokens=100,
temperature=0.7
)
print("Response:", response.choices[0].message.content)
print(f"Tokens: {response.usage.completion_tokens}")
Expected output:
Response: CUDA (Compute Unified Device Architecture) is a parallel
computing platform and programming model developed by NVIDIA that
allows developers to utilize the massive processing power of their
GPUs for general-purpose computations...
Tokens: 76
Step 6: Test Streaming¶
Try streaming responses for real-time generation:
print("Streaming response:\n")
for chunk in client.chat.create(
messages=[
{"role": "user", "content": "Write a Python function to calculate factorial."}
],
max_tokens=200,
temperature=0.3,
stream=True
):
if hasattr(chunk, 'choices') and chunk.choices:
delta = chunk.choices[0].delta
if hasattr(delta, 'content') and delta.content:
print(delta.content, end="", flush=True)
print("\n\nDone!")
Step 7: Monitor GPU Usage¶
Check VRAM usage across both GPUs:
Expected output:
index, name, memory.used [MiB], memory.total [MiB], utilization.gpu [%]
0, Tesla T4, 1307 MiB, 15360 MiB, 0 %
1, Tesla T4, 0 MiB, 15360 MiB, 0 %
Note:
- GPU 0: Running llama-server (~1.3GB VRAM)
- GPU 1: Free for RAPIDS/Graphistry visualization
Step 8: Cleanup¶
Stop the server and release resources:
server.stop_server()
print("Server stopped. Resources freed.")
# Verify GPU memory released
!nvidia-smi --query-gpu=index,memory.used --format=csv
Complete Example¶
Here's the complete workflow in one cell:
# 1. Install
!pip install -q --no-cache-dir --force-reinstall \
git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0
# 2. Download model
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="unsloth/gemma-3-1b-it-GGUF",
filename="gemma-3-1b-it-Q4_K_M.gguf",
local_dir="/kaggle/working/models",
)
# 3. Start server
from llamatelemetry.server import ServerManager
server = ServerManager()
server.start_server(
model_path=model_path,
gpu_layers=99,
tensor_split="1.0,0.0",
flash_attn=1,
)
# 4. Run inference
from llamatelemetry.api.client import LlamaCppClient
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
response = client.chat.create(
messages=[{"role": "user", "content": "What is CUDA?"}],
max_tokens=80,
)
print(response.choices[0].message.content)
# 5. Cleanup
server.stop_server()
What's Next?¶
Recommended Learning Path¶
- First Steps - Understand core concepts
- Tutorial 01: Quick Start - Detailed walkthrough
- Tutorial 02: Server Setup - Advanced configuration
- Tutorial 03: Multi-GPU Inference - Dual GPU setup
Observability Features ⭐ NEW¶
Explore the Observability Trilogy (notebooks 14-16):
- Tutorial 14: OpenTelemetry - Full OTel integration
- Tutorial 15: Real-time Monitoring - Live dashboards
- Tutorial 16: Production Stack - Complete observability
Advanced Topics¶
- Architecture Overview - Understand split-GPU design
- API Reference - Complete API documentation
- Performance Guide - Optimization tips
Common Issues¶
Server won't start¶
Check if port 8080 is already in use:
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = sock.connect_ex(('127.0.0.1', 8080))
if result == 0:
print("Port 8080 is in use")
else:
print("Port 8080 is free")
sock.close()
Out of memory¶
Use a smaller model or reduce context size:
server.start_server(
model_path=model_path,
gpu_layers=99,
tensor_split="1.0,0.0",
flash_attn=1,
ctx_size=4096, # Reduce from default 8192
)
Slow inference¶
Enable FlashAttention and ensure GPU 0 is dedicated to inference:
server.start_server(
model_path=model_path,
gpu_layers=99,
tensor_split="1.0,0.0", # GPU 0 only
flash_attn=1, # Enable FlashAttention
)
Need Help?¶
- Troubleshooting Guide - Common issues and solutions
- GitHub Issues - Report bugs
- Discussions - Ask questions