First Steps¶

After installing llamatelemetry, this guide will help you understand the core concepts and workflow.

Core Concepts¶

1. Split-GPU Architecture¶

llamatelemetry uses a split-GPU design to maximize hardware utilization on Kaggle dual T4:

GPU 0 (15GB): Runs llama-server for LLM inference
GPU 1 (15GB): Runs RAPIDS cuGraph + Graphistry for analytics

2. GGUF Models¶

llamatelemetry uses GGUF (GPT-Generated Unified Format) models:

Quantized weights (1-8 bits)
Optimized for CPU/GPU inference
29 quantization types available
Recommended: Q4_K_M for balance

3. OpenTelemetry Integration¶

Full observability with OpenTelemetry:

Traces: Request flow and latency
Metrics: Tokens/sec, VRAM, GPU temperature
Semantic conventions: LLM-specific attributes

4. llama-server¶

Built-in llama.cpp server with:

OpenAI-compatible API
FlashAttention v2 support
Multi-GPU tensor parallelism
Streaming responses

Basic Workflow¶

1. Install and Verify¶

# Install
!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

# Verify
import llamatelemetry
print(llamatelemetry.__version__)

2. Download Model¶

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",
)

3. Start Server¶

from llamatelemetry.server import ServerManager

server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,              # All layers on GPU
    tensor_split="1.0,0.0",     # GPU 0 only
    flash_attn=1,               # Enable FlashAttention
)

4. Run Inference¶

from llamatelemetry.api.client import LlamaCppClient

client = LlamaCppClient(base_url="http://127.0.0.1:8080")

response = client.chat.create(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=50
)

print(response.choices[0].message.content)

5. Add Observability¶

from llamatelemetry.telemetry import setup_telemetry

tracer, meter = setup_telemetry(service_name="my-llm-service")

with tracer.start_as_current_span("chat") as span:
    response = client.chat.create(
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=50
    )
    span.set_attribute("llm.tokens", response.usage.completion_tokens)

6. Visualize with Graphistry¶

from llamatelemetry.graphistry import TracesGraphistry

# Collect spans
spans = memory_exporter.get_finished_spans()

# Visualize on GPU 1
g = TracesGraphistry(spans=spans)
g.plot(render=True)

Key Classes and Functions¶

ServerManager¶

Manages llama-server lifecycle:

from llamatelemetry.server import ServerManager

server = ServerManager()
server.start_server(model_path=path, gpu_layers=99)
server.stop_server()

LlamaCppClient¶

OpenAI-compatible client:

from llamatelemetry.api.client import LlamaCppClient

client = LlamaCppClient(base_url="http://127.0.0.1:8080")

# Chat completion
response = client.chat.create(messages=[...])

# Streaming
for chunk in client.chat.create(messages=[...], stream=True):
    print(chunk.choices[0].delta.content, end="")

Telemetry Setup¶

Initialize OpenTelemetry:

from llamatelemetry.telemetry import setup_telemetry

tracer, meter = setup_telemetry(
    service_name="my-service",
    otlp_endpoint="http://localhost:4317"
)

GPU Monitoring¶

Monitor GPU metrics:

from llamatelemetry.telemetry.gpu_monitor import GPUMonitor

monitor = GPUMonitor(gpu_id=0, interval=1.0)
monitor.start()

# Get metrics
metrics = monitor.get_metrics()
print(f"VRAM: {metrics['memory_used_mb']} MB")
print(f"Temp: {metrics['temperature_c']}°C")

Understanding Quantization¶

What is Quantization?¶

Quantization reduces model size by using fewer bits per weight:

FP16: 16-bit floats (baseline)
Q8: 8-bit quantization (~50% size)
Q4: 4-bit quantization (~25% size)
Q2: 2-bit quantization (~12.5% size)

K-Quants vs I-Quants¶

K-Quants (recommended): - Better quality - Layer-specific quantization - Examples: Q4_K_M, Q5_K_S

I-Quants (newer): - Even smaller size - Importance-weighted quantization - Examples: IQ4_XS, IQ3_M

Recommended Quantizations¶

Use Case	Quantization	Quality	Speed
Production	Q4_K_M	Good	Fast
High quality	Q5_K_M	Better	Medium
Maximum speed	Q4_0	OK	Fastest
Smallest size	IQ3_M	Lower	Fast

Next Steps¶

Recommended Learning Path¶

Quick Start - Get running in 5 minutes
Tutorial 01: Quick Start - Detailed walkthrough
Tutorial 02: Server Setup - Advanced configuration
Tutorial 03: Multi-GPU Inference - Dual GPU setup

Explore Observability ⭐ NEW¶

Tutorial 14: OpenTelemetry - Full OTel integration
Tutorial 15: Real-time Monitoring - Live dashboards
Tutorial 16: Production Stack - Complete stack

Deep Dive¶

Architecture - System design
API Reference - Complete API docs
Performance - Optimization guide
GGUF Guide - Quantization details

Common Patterns¶

Pattern 1: Basic Inference¶

server = ServerManager()
server.start_server(model_path=path)

client = LlamaCppClient()
response = client.chat.create(messages=[...])

server.stop_server()

Pattern 2: Inference + Telemetry¶

tracer, _ = setup_telemetry(service_name="llm")

server = ServerManager()
server.start_server(model_path=path)

client = LlamaCppClient()
with tracer.start_as_current_span("request"):
    response = client.chat.create(messages=[...])

server.stop_server()

Pattern 3: Split-GPU with Visualization¶

# GPU 0: Inference
server = ServerManager()
server.start_server(model_path=path, tensor_split="1.0,0.0")

# GPU 1: Analytics
import cudf
import graphistry
graphistry.register(api=3, server="hub.graphistry.com")

# Visualize
g = TracesGraphistry(spans=spans)
g.plot(render=True)

Tips and Best Practices¶

Always use tensor_split="1.0,0.0" for split-GPU workflow
Enable FlashAttention with flash_attn=1 for 2-3x speedup
Use Q4_K_M quantization for balanced quality and performance
Monitor VRAM with nvidia-smi to avoid OOM
Start with small models (1B-5B) before trying larger ones
Save outputs to /kaggle/working/ for persistence
Clean up with server.stop_server() when done

Need Help?¶

Troubleshooting - Common issues
Kaggle Setup - Kaggle-specific guide
GitHub Issues - Report bugs
Discussions - Ask questions