Telemetry and Observability¶
llamatelemetry includes an actual llamatelemetry.telemetry package in the
uploaded SDK snapshot, so this guide should describe what is clearly present in
code and avoid overclaiming what has been broadly validated.
The current package gives you four practical observability layers:
- OpenTelemetry tracer and meter setup
- GPU-aware metrics collection
- optional llama.cpp
/metricspolling - an instrumented client for
llama-serverstyle APIs
What is available in the current package¶
The telemetry package exposes these notable entry points:
setup_telemetry()setup_grafana_otlp()get_metrics_collector()PerformanceMonitorInstrumentedLlamaCppClient- helper utilities for span annotation and auto-instrumentation
1. Minimal setup with setup_telemetry()¶
from llamatelemetry.telemetry import setup_telemetry
tracer, meter = setup_telemetry(
service_name="llamatelemetry-demo",
service_version="0.1.1",
otlp_endpoint="http://localhost:4317",
llama_server_url="http://127.0.0.1:8080",
enable_llama_metrics=True,
llama_metrics_interval=5.0,
)
print(tracer)
print(meter)
If OpenTelemetry SDK packages are not installed, this function returns (None,
None) and warns instead of pretending setup succeeded.
2. Kaggle- or secret-driven OTLP setup¶
The package also exposes setup_grafana_otlp() and Kaggle secret helpers.
from llamatelemetry.telemetry import setup_grafana_otlp
tracer, meter = setup_grafana_otlp(
service_name="llamatelemetry-kaggle",
service_version="0.1.1",
enable_llama_metrics=True,
)
This is useful when you want to rely on OTLP-related environment variables or Kaggle secrets rather than hard-coding the endpoint in notebook cells.
3. Enabling telemetry in InferenceEngine¶
At the high level, telemetry is wired into InferenceEngine itself.
import llamatelemetry as lt
engine = lt.InferenceEngine(
server_url="http://127.0.0.1:8080",
enable_telemetry=True,
telemetry_config={
"service_name": "my-inference-service",
"service_version": "0.1.1",
"otlp_endpoint": "http://localhost:4317",
"enable_llama_metrics": True,
"llama_metrics_interval": 5.0,
},
)
When you run engine.infer(...), the engine can create spans and forward timing
and token information into the telemetry layer when the dependencies are
available.
4. Recording data with the instrumented client¶
The telemetry package includes InstrumentedLlamaCppClient.
from llamatelemetry.telemetry import InstrumentedLlamaCppClient, setup_telemetry
tracer, meter = setup_telemetry(
service_name="client-demo",
otlp_endpoint="http://localhost:4317",
)
client = InstrumentedLlamaCppClient(base_url="http://127.0.0.1:8080")
This client is the better choice when you want a lower-level, telemetry-aware
HTTP client rather than the full InferenceEngine abstraction.
5. Lightweight local monitoring with PerformanceMonitor¶
Not every workflow needs OTLP export. The package also includes an in-process monitor.
from llamatelemetry.telemetry import PerformanceMonitor
monitor = PerformanceMonitor()
monitor.start()
monitor.record(latency_ms=120.0, tokens=64, success=True)
monitor.record(latency_ms=180.0, tokens=96, success=True)
monitor.record(latency_ms=0.0, tokens=0, success=False)
print(monitor.get_summary())
monitor.stop()
Use this for notebook experiments where you want fast local feedback without a full observability stack behind it.
6. Access the active GPU metrics collector¶
from llamatelemetry.telemetry import get_metrics_collector, setup_telemetry
setup_telemetry(service_name="metrics-demo")
collector = get_metrics_collector()
print(collector)
That collector is created during telemetry setup and is also used by some other parts of the package such as the NCCL-oriented code paths.
7. Kaggle secret loading¶
There are two relevant paths in the current codebase.
Telemetry-level helper¶
from llamatelemetry.telemetry import setup_otlp_env_from_kaggle_secrets
print(setup_otlp_env_from_kaggle_secrets())
Kaggle pipeline helper¶
from llamatelemetry.kaggle import load_grafana_otlp_env_from_kaggle
print(load_grafana_otlp_env_from_kaggle())
The Kaggle helper supports the more Grafana-specific naming convention, while the telemetry helper supports a more generic OTLP secret layout.
8. A practical end-to-end pattern¶
import llamatelemetry as lt
from llamatelemetry.telemetry import setup_telemetry
tracer, meter = setup_telemetry(
service_name="llamatelemetry-demo",
otlp_endpoint="http://localhost:4317",
llama_server_url="http://127.0.0.1:8080",
enable_llama_metrics=True,
)
engine = lt.InferenceEngine(
server_url="http://127.0.0.1:8080",
enable_telemetry=True,
telemetry_config={
"service_name": "llamatelemetry-demo",
"otlp_endpoint": "http://localhost:4317",
"enable_llama_metrics": True,
},
)
That pattern is intentionally simple: initialize telemetry, then let the engine reuse the same observability direction.
9. Documentation guardrails for this section¶
This guide now avoids a few claims that were too aggressive:
- it does not hard-claim a specific count like “45 semantic attributes” unless you freeze that count with tests and release validation
- it does not imply every backend is equally verified just because OTLP export exists in code
- it distinguishes between code present in the package and field-tested observability workflows
Suggested secret names¶
For generic OTLP use:
OTLP_ENDPOINTOTLP_TOKEN
For Grafana-style Kaggle setups:
GRAFANA_OTLP_ENDPOINTGRAFANA_OTLP_HEADERSGRAFANA_OTLP_TOKEN