Skip to content

First Steps

After installing llamatelemetry, this guide will help you understand the core concepts and workflow.


Core Concepts

1. Split-GPU Architecture

llamatelemetry uses a split-GPU design to maximize hardware utilization on Kaggle dual T4:

  • GPU 0 (15GB): Runs llama-server for LLM inference
  • GPU 1 (15GB): Runs RAPIDS cuGraph + Graphistry for analytics

2. GGUF Models

llamatelemetry uses GGUF (GPT-Generated Unified Format) models:

  • Quantized weights (1-8 bits)
  • Optimized for CPU/GPU inference
  • 29 quantization types available
  • Recommended: Q4_K_M for balance

3. OpenTelemetry Integration

Full observability with OpenTelemetry:

  • Traces: Request flow and latency
  • Metrics: Tokens/sec, VRAM, GPU temperature
  • Semantic conventions: LLM-specific attributes

4. llama-server

Built-in llama.cpp server with:

  • OpenAI-compatible API
  • FlashAttention v2 support
  • Multi-GPU tensor parallelism
  • Streaming responses

Basic Workflow

1. Install and Verify

# Install
!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

# Verify
import llamatelemetry
print(llamatelemetry.__version__)

2. Download Model

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models",
)

3. Start Server

from llamatelemetry.server import ServerManager

server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,              # All layers on GPU
    tensor_split="1.0,0.0",     # GPU 0 only
    flash_attn=1,               # Enable FlashAttention
)

4. Run Inference

from llamatelemetry.api.client import LlamaCppClient

client = LlamaCppClient(base_url="http://127.0.0.1:8080")

response = client.chat.create(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=50
)

print(response.choices[0].message.content)

5. Add Observability

from llamatelemetry.telemetry import setup_telemetry

tracer, meter = setup_telemetry(service_name="my-llm-service")

with tracer.start_as_current_span("chat") as span:
    response = client.chat.create(
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=50
    )
    span.set_attribute("llm.tokens", response.usage.completion_tokens)

6. Visualize with Graphistry

from llamatelemetry.graphistry import TracesGraphistry

# Collect spans
spans = memory_exporter.get_finished_spans()

# Visualize on GPU 1
g = TracesGraphistry(spans=spans)
g.plot(render=True)

Key Classes and Functions

ServerManager

Manages llama-server lifecycle:

from llamatelemetry.server import ServerManager

server = ServerManager()
server.start_server(model_path=path, gpu_layers=99)
server.stop_server()

LlamaCppClient

OpenAI-compatible client:

from llamatelemetry.api.client import LlamaCppClient

client = LlamaCppClient(base_url="http://127.0.0.1:8080")

# Chat completion
response = client.chat.create(messages=[...])

# Streaming
for chunk in client.chat.create(messages=[...], stream=True):
    print(chunk.choices[0].delta.content, end="")

Telemetry Setup

Initialize OpenTelemetry:

from llamatelemetry.telemetry import setup_telemetry

tracer, meter = setup_telemetry(
    service_name="my-service",
    otlp_endpoint="http://localhost:4317"
)

GPU Monitoring

Monitor GPU metrics:

from llamatelemetry.telemetry.gpu_monitor import GPUMonitor

monitor = GPUMonitor(gpu_id=0, interval=1.0)
monitor.start()

# Get metrics
metrics = monitor.get_metrics()
print(f"VRAM: {metrics['memory_used_mb']} MB")
print(f"Temp: {metrics['temperature_c']}°C")

Understanding Quantization

What is Quantization?

Quantization reduces model size by using fewer bits per weight:

  • FP16: 16-bit floats (baseline)
  • Q8: 8-bit quantization (~50% size)
  • Q4: 4-bit quantization (~25% size)
  • Q2: 2-bit quantization (~12.5% size)

K-Quants vs I-Quants

K-Quants (recommended): - Better quality - Layer-specific quantization - Examples: Q4_K_M, Q5_K_S

I-Quants (newer): - Even smaller size - Importance-weighted quantization - Examples: IQ4_XS, IQ3_M

Use Case Quantization Quality Speed
Production Q4_K_M Good Fast
High quality Q5_K_M Better Medium
Maximum speed Q4_0 OK Fastest
Smallest size IQ3_M Lower Fast

Next Steps

  1. Quick Start - Get running in 5 minutes
  2. Tutorial 01: Quick Start - Detailed walkthrough
  3. Tutorial 02: Server Setup - Advanced configuration
  4. Tutorial 03: Multi-GPU Inference - Dual GPU setup

Explore Observability ⭐ NEW

  1. Tutorial 14: OpenTelemetry - Full OTel integration
  2. Tutorial 15: Real-time Monitoring - Live dashboards
  3. Tutorial 16: Production Stack - Complete stack

Deep Dive


Common Patterns

Pattern 1: Basic Inference

server = ServerManager()
server.start_server(model_path=path)

client = LlamaCppClient()
response = client.chat.create(messages=[...])

server.stop_server()

Pattern 2: Inference + Telemetry

tracer, _ = setup_telemetry(service_name="llm")

server = ServerManager()
server.start_server(model_path=path)

client = LlamaCppClient()
with tracer.start_as_current_span("request"):
    response = client.chat.create(messages=[...])

server.stop_server()

Pattern 3: Split-GPU with Visualization

# GPU 0: Inference
server = ServerManager()
server.start_server(model_path=path, tensor_split="1.0,0.0")

# GPU 1: Analytics
import cudf
import graphistry
graphistry.register(api=3, server="hub.graphistry.com")

# Visualize
g = TracesGraphistry(spans=spans)
g.plot(render=True)

Tips and Best Practices

  1. Always use tensor_split="1.0,0.0" for split-GPU workflow
  2. Enable FlashAttention with flash_attn=1 for 2-3x speedup
  3. Use Q4_K_M quantization for balanced quality and performance
  4. Monitor VRAM with nvidia-smi to avoid OOM
  5. Start with small models (1B-5B) before trying larger ones
  6. Save outputs to /kaggle/working/ for persistence
  7. Clean up with server.stop_server() when done

Need Help?