Telemetry and Observability¶

llamatelemetry includes an actual llamatelemetry.telemetry package in the uploaded SDK snapshot, so this guide should describe what is clearly present in code and avoid overclaiming what has been broadly validated.

The current package gives you four practical observability layers:

OpenTelemetry tracer and meter setup
GPU-aware metrics collection
optional llama.cpp /metrics polling
an instrumented client for llama-server style APIs

What is available in the current package¶

The telemetry package exposes these notable entry points:

setup_telemetry()
setup_grafana_otlp()
get_metrics_collector()
PerformanceMonitor
InstrumentedLlamaCppClient
helper utilities for span annotation and auto-instrumentation

1. Minimal setup with `setup_telemetry()`¶

from llamatelemetry.telemetry import setup_telemetry

tracer, meter = setup_telemetry(
    service_name="llamatelemetry-demo",
    service_version="0.1.1",
    otlp_endpoint="http://localhost:4317",
    llama_server_url="http://127.0.0.1:8080",
    enable_llama_metrics=True,
    llama_metrics_interval=5.0,
)

print(tracer)
print(meter)

If OpenTelemetry SDK packages are not installed, this function returns (None, None) and warns instead of pretending setup succeeded.

2. Kaggle- or secret-driven OTLP setup¶

The package also exposes setup_grafana_otlp() and Kaggle secret helpers.

from llamatelemetry.telemetry import setup_grafana_otlp

tracer, meter = setup_grafana_otlp(
    service_name="llamatelemetry-kaggle",
    service_version="0.1.1",
    enable_llama_metrics=True,
)

This is useful when you want to rely on OTLP-related environment variables or Kaggle secrets rather than hard-coding the endpoint in notebook cells.

3. Enabling telemetry in `InferenceEngine`¶

At the high level, telemetry is wired into InferenceEngine itself.

import llamatelemetry as lt

engine = lt.InferenceEngine(
    server_url="http://127.0.0.1:8080",
    enable_telemetry=True,
    telemetry_config={
        "service_name": "my-inference-service",
        "service_version": "0.1.1",
        "otlp_endpoint": "http://localhost:4317",
        "enable_llama_metrics": True,
        "llama_metrics_interval": 5.0,
    },
)

When you run engine.infer(...), the engine can create spans and forward timing and token information into the telemetry layer when the dependencies are available.

4. Recording data with the instrumented client¶

The telemetry package includes InstrumentedLlamaCppClient.

from llamatelemetry.telemetry import InstrumentedLlamaCppClient, setup_telemetry

tracer, meter = setup_telemetry(
    service_name="client-demo",
    otlp_endpoint="http://localhost:4317",
)

client = InstrumentedLlamaCppClient(base_url="http://127.0.0.1:8080")

This client is the better choice when you want a lower-level, telemetry-aware HTTP client rather than the full InferenceEngine abstraction.

5. Lightweight local monitoring with `PerformanceMonitor`¶

Not every workflow needs OTLP export. The package also includes an in-process monitor.

from llamatelemetry.telemetry import PerformanceMonitor

monitor = PerformanceMonitor()
monitor.start()

monitor.record(latency_ms=120.0, tokens=64, success=True)
monitor.record(latency_ms=180.0, tokens=96, success=True)
monitor.record(latency_ms=0.0, tokens=0, success=False)

print(monitor.get_summary())
monitor.stop()

Use this for notebook experiments where you want fast local feedback without a full observability stack behind it.

6. Access the active GPU metrics collector¶

from llamatelemetry.telemetry import get_metrics_collector, setup_telemetry

setup_telemetry(service_name="metrics-demo")
collector = get_metrics_collector()
print(collector)

That collector is created during telemetry setup and is also used by some other parts of the package such as the NCCL-oriented code paths.

7. Kaggle secret loading¶

There are two relevant paths in the current codebase.

Telemetry-level helper¶

from llamatelemetry.telemetry import setup_otlp_env_from_kaggle_secrets

print(setup_otlp_env_from_kaggle_secrets())

Kaggle pipeline helper¶

from llamatelemetry.kaggle import load_grafana_otlp_env_from_kaggle

print(load_grafana_otlp_env_from_kaggle())

The Kaggle helper supports the more Grafana-specific naming convention, while the telemetry helper supports a more generic OTLP secret layout.

8. A practical end-to-end pattern¶

import llamatelemetry as lt
from llamatelemetry.telemetry import setup_telemetry

tracer, meter = setup_telemetry(
    service_name="llamatelemetry-demo",
    otlp_endpoint="http://localhost:4317",
    llama_server_url="http://127.0.0.1:8080",
    enable_llama_metrics=True,
)

engine = lt.InferenceEngine(
    server_url="http://127.0.0.1:8080",
    enable_telemetry=True,
    telemetry_config={
        "service_name": "llamatelemetry-demo",
        "otlp_endpoint": "http://localhost:4317",
        "enable_llama_metrics": True,
    },
)

That pattern is intentionally simple: initialize telemetry, then let the engine reuse the same observability direction.

9. Documentation guardrails for this section¶

This guide now avoids a few claims that were too aggressive:

it does not hard-claim a specific count like “45 semantic attributes” unless you freeze that count with tests and release validation
it does not imply every backend is equally verified just because OTLP export exists in code
it distinguishes between code present in the package and field-tested observability workflows

Suggested secret names¶

For generic OTLP use:

OTLP_ENDPOINT
OTLP_TOKEN

For Grafana-style Kaggle setups:

GRAFANA_OTLP_ENDPOINT
GRAFANA_OTLP_HEADERS
GRAFANA_OTLP_TOKEN