First Steps¶
After installing llamatelemetry, this guide will help you understand the core concepts and workflow.
Core Concepts¶
1. Split-GPU Architecture¶
llamatelemetry uses a split-GPU design to maximize hardware utilization on Kaggle dual T4:
- GPU 0 (15GB): Runs llama-server for LLM inference
- GPU 1 (15GB): Runs RAPIDS cuGraph + Graphistry for analytics
2. GGUF Models¶
llamatelemetry uses GGUF (GPT-Generated Unified Format) models:
- Quantized weights (1-8 bits)
- Optimized for CPU/GPU inference
- 29 quantization types available
- Recommended: Q4_K_M for balance
3. OpenTelemetry Integration¶
Full observability with OpenTelemetry:
- Traces: Request flow and latency
- Metrics: Tokens/sec, VRAM, GPU temperature
- Semantic conventions: LLM-specific attributes
4. llama-server¶
Built-in llama.cpp server with:
- OpenAI-compatible API
- FlashAttention v2 support
- Multi-GPU tensor parallelism
- Streaming responses
Basic Workflow¶
1. Install and Verify¶
# Install
!pip install -q --no-cache-dir --force-reinstall \
git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0
# Verify
import llamatelemetry
print(llamatelemetry.__version__)
2. Download Model¶
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="unsloth/gemma-3-1b-it-GGUF",
filename="gemma-3-1b-it-Q4_K_M.gguf",
local_dir="/kaggle/working/models",
)
3. Start Server¶
from llamatelemetry.server import ServerManager
server = ServerManager()
server.start_server(
model_path=model_path,
gpu_layers=99, # All layers on GPU
tensor_split="1.0,0.0", # GPU 0 only
flash_attn=1, # Enable FlashAttention
)
4. Run Inference¶
from llamatelemetry.api.client import LlamaCppClient
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
response = client.chat.create(
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=50
)
print(response.choices[0].message.content)
5. Add Observability¶
from llamatelemetry.telemetry import setup_telemetry
tracer, meter = setup_telemetry(service_name="my-llm-service")
with tracer.start_as_current_span("chat") as span:
response = client.chat.create(
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=50
)
span.set_attribute("llm.tokens", response.usage.completion_tokens)
6. Visualize with Graphistry¶
from llamatelemetry.graphistry import TracesGraphistry
# Collect spans
spans = memory_exporter.get_finished_spans()
# Visualize on GPU 1
g = TracesGraphistry(spans=spans)
g.plot(render=True)
Key Classes and Functions¶
ServerManager¶
Manages llama-server lifecycle:
from llamatelemetry.server import ServerManager
server = ServerManager()
server.start_server(model_path=path, gpu_layers=99)
server.stop_server()
LlamaCppClient¶
OpenAI-compatible client:
from llamatelemetry.api.client import LlamaCppClient
client = LlamaCppClient(base_url="http://127.0.0.1:8080")
# Chat completion
response = client.chat.create(messages=[...])
# Streaming
for chunk in client.chat.create(messages=[...], stream=True):
print(chunk.choices[0].delta.content, end="")
Telemetry Setup¶
Initialize OpenTelemetry:
from llamatelemetry.telemetry import setup_telemetry
tracer, meter = setup_telemetry(
service_name="my-service",
otlp_endpoint="http://localhost:4317"
)
GPU Monitoring¶
Monitor GPU metrics:
from llamatelemetry.telemetry.gpu_monitor import GPUMonitor
monitor = GPUMonitor(gpu_id=0, interval=1.0)
monitor.start()
# Get metrics
metrics = monitor.get_metrics()
print(f"VRAM: {metrics['memory_used_mb']} MB")
print(f"Temp: {metrics['temperature_c']}°C")
Understanding Quantization¶
What is Quantization?¶
Quantization reduces model size by using fewer bits per weight:
- FP16: 16-bit floats (baseline)
- Q8: 8-bit quantization (~50% size)
- Q4: 4-bit quantization (~25% size)
- Q2: 2-bit quantization (~12.5% size)
K-Quants vs I-Quants¶
K-Quants (recommended): - Better quality - Layer-specific quantization - Examples: Q4_K_M, Q5_K_S
I-Quants (newer): - Even smaller size - Importance-weighted quantization - Examples: IQ4_XS, IQ3_M
Recommended Quantizations¶
| Use Case | Quantization | Quality | Speed |
|---|---|---|---|
| Production | Q4_K_M | Good | Fast |
| High quality | Q5_K_M | Better | Medium |
| Maximum speed | Q4_0 | OK | Fastest |
| Smallest size | IQ3_M | Lower | Fast |
Next Steps¶
Recommended Learning Path¶
- Quick Start - Get running in 5 minutes
- Tutorial 01: Quick Start - Detailed walkthrough
- Tutorial 02: Server Setup - Advanced configuration
- Tutorial 03: Multi-GPU Inference - Dual GPU setup
Explore Observability ⭐ NEW¶
- Tutorial 14: OpenTelemetry - Full OTel integration
- Tutorial 15: Real-time Monitoring - Live dashboards
- Tutorial 16: Production Stack - Complete stack
Deep Dive¶
- Architecture - System design
- API Reference - Complete API docs
- Performance - Optimization guide
- GGUF Guide - Quantization details
Common Patterns¶
Pattern 1: Basic Inference¶
server = ServerManager()
server.start_server(model_path=path)
client = LlamaCppClient()
response = client.chat.create(messages=[...])
server.stop_server()
Pattern 2: Inference + Telemetry¶
tracer, _ = setup_telemetry(service_name="llm")
server = ServerManager()
server.start_server(model_path=path)
client = LlamaCppClient()
with tracer.start_as_current_span("request"):
response = client.chat.create(messages=[...])
server.stop_server()
Pattern 3: Split-GPU with Visualization¶
# GPU 0: Inference
server = ServerManager()
server.start_server(model_path=path, tensor_split="1.0,0.0")
# GPU 1: Analytics
import cudf
import graphistry
graphistry.register(api=3, server="hub.graphistry.com")
# Visualize
g = TracesGraphistry(spans=spans)
g.plot(render=True)
Tips and Best Practices¶
- Always use tensor_split="1.0,0.0" for split-GPU workflow
- Enable FlashAttention with
flash_attn=1for 2-3x speedup - Use Q4_K_M quantization for balanced quality and performance
- Monitor VRAM with
nvidia-smito avoid OOM - Start with small models (1B-5B) before trying larger ones
- Save outputs to
/kaggle/working/for persistence - Clean up with
server.stop_server()when done
Need Help?¶
- Troubleshooting - Common issues
- Kaggle Setup - Kaggle-specific guide
- GitHub Issues - Report bugs
- Discussions - Ask questions