Architecture Overview¶
llamatelemetry is a CUDA-first Python SDK that orchestrates llama-server for GGUF
model inference with optional OpenTelemetry observability, Kaggle environment management,
and GPU-accelerated graph analytics. The SDK is built on approximately 40 Python files
(13,000+ lines) across 10 modules, backed by 7 C++/CUDA source files (~650 lines).
High-Level Architecture¶
+-----------------------------------------------------------------------+
| User Application |
| engine = InferenceEngine() |
| engine.load_model("gemma-3-1b-Q4_K_M") |
| result = engine.infer("What is AI?", max_tokens=100) |
+-----------------------------------------------------------------------+
| | |
v v v
+----------------+ +------------------+ +------------------------+
| InferenceEngine| | Telemetry Layer | | Kaggle Environment |
| (__init__.py) | | (telemetry/) | | (kaggle/) |
| - load_model() | | - setup_telemetry| | - KaggleEnvironment |
| - infer() | | - OTel spans | | - split_gpu_session |
| - InferResult | | - GPU metrics | | - ServerPreset |
+-------+--------+ +--------+---------+ +-----------+------------+
| | |
v v v
+----------------+ +------------------+ +------------------------+
| ServerManager | | LlamaCppClient | | Graphistry / RAPIDS |
| (server.py) | | (api/client.py) | | (graphistry/) |
| - start/stop | | - chat API | | - GraphWorkload |
| - health check | | - completions | | - RAPIDSBackend |
| - auto-download| | - embeddings | | - knowledge graphs |
+-------+--------+ +--------+---------+ +-----------+------------+
| | |
v v v
+-----------------------------------------------------------------------+
| llama-server (binary) |
| - OpenAI-compatible REST API |
| - Multi-slot continuous batching |
| - CUDA 12 GPU acceleration (SM 7.5 / Tesla T4) |
+-----------------------------------------------------------------------+
|
v
+-----------------------------------------------------------------------+
| C++/CUDA Extension (csrc/) |
| - llamatelemetry_cpp pybind11 module |
| - Tensor RAII (6 DTypes), Device ops |
| - cuBLAS matmul (SGEMM / HGEMM) |
+-----------------------------------------------------------------------+
10-Module Structure¶
The Python package is organized into 10 self-contained modules plus several top-level utility files.
Core Modules¶
| Module | Files | Purpose |
|---|---|---|
__init__.py + server.py |
2 | InferenceEngine, InferResult, ServerManager -- high-level API and server lifecycle |
api/ |
4 | LlamaCppClient (OpenAI-compatible + native), MultiGPUConfig, GGUF utilities, NCCL bindings |
telemetry/ |
10 | OpenTelemetry integration: 45 gen_ai.* attributes, 5 metrics, auto-instrumentation, Graphistry export |
kaggle/ |
5 | KaggleEnvironment, split_gpu_session, server presets, secrets management, GPU context isolation |
Inference and GPU Modules¶
| Module | Files | Purpose |
|---|---|---|
inference/ |
4 | FlashAttention helpers, KV cache management, continuous batching |
cuda/ |
4 | CUDA graph capture, Triton kernel wrappers, TensorCore utilities |
quantization/ |
4 | NF4 quantization, GGUF conversion, dynamic quantization |
Analytics and Integration Modules¶
| Module | Files | Purpose |
|---|---|---|
graphistry/ |
5 | GPU-accelerated graph visualization, RAPIDS cuGraph, split-GPU workload management |
louie/ |
3 | AI-powered graph analysis, knowledge extraction via Louie.AI |
unsloth/ |
4 | Fine-tuning integration, LoRA adapter management, GGUF export |
Internal Module¶
| Module | Files | Purpose |
|---|---|---|
_internal/ |
3 | Bootstrap (auto-download ~961 MB of CUDA binaries), MODEL_REGISTRY with 30+ curated GGUF models |
C++/CUDA Layer (csrc/)¶
The native extension is built as llamatelemetry_cpp via pybind11 and CMake.
csrc/
bindings.cpp pybind11 module definition
core/
device.h / device.cu CUDA device discovery and management
tensor.h / tensor.cu Tensor RAII class (6 DTypes: float32, float16, bfloat16, int32, int8, uint8)
ops/
matmul.h / matmul.cu cuBLAS matrix multiplication (SGEMM for float32, HGEMM for float16)
Link dependencies: cudart_static, cublas_static, cublasLt_static
The C++/CUDA layer is compiled during the Kaggle build pipeline and distributed as part of the CUDA binary tar.gz artifact. It is not required for basic Python-only usage of the SDK.
Data Flow: Inference Request¶
The following sequence shows what happens when engine.infer() is called.
User Code
|
| engine.infer("What is AI?", max_tokens=100)
v
InferenceEngine.__init__.py
|
| 1. Resolve model (registry lookup or local path)
| 2. Ensure ServerManager is running
| 3. Build request payload
v
LlamaCppClient (api/client.py)
|
| 4. POST /completion or /v1/chat/completions
| 5. Parse JSON response into dataclasses
v
ServerManager (server.py)
|
| 6. Forward to llama-server process
| 7. llama-server runs CUDA inference
v
InferResult
|
| 8. Wrap response: success, text, tokens_generated,
| latency_ms, tokens_per_sec
v
(Optional) Telemetry
|
| 9. Emit gen_ai.* span attributes
| 10. Record metrics (latency, throughput, VRAM)
v
Return to User
Data Flow: Bootstrap Sequence¶
On first import, the SDK auto-configures itself.
import llamatelemetry
|
| 1. Set LD_LIBRARY_PATH to include bundled libs
| 2. Set LLAMA_SERVER_PATH to bundled binary
v
Binary exists?
|
+-- YES --> Ready
|
+-- NO --> _internal.bootstrap.bootstrap()
|
| 3. Download CUDA bundle from GitHub Releases
| (~961 MB: llama-server, llama-cli, libnccl.so, etc.)
| 4. Extract to llamatelemetry/binaries/cuda12/
| 5. Set environment variables
v
Ready
Design Patterns¶
Context Managers¶
The SDK uses Python context managers throughout for resource lifecycle management.
# InferenceEngine
with InferenceEngine() as engine:
engine.load_model("gemma-3-1b-Q4_K_M")
result = engine.infer("Hello")
# ServerManager
with ServerManager() as server:
server.start_server(model_path="model.gguf")
# NCCLCommunicator
with NCCLCommunicator(config) as comm:
comm.all_reduce(tensor)
# GPU context isolation
with split_gpu_session(llm_gpu=0, rapids_gpu=1):
# GPU 0 for inference, GPU 1 for analytics
pass
Lazy Loading and Optional Dependencies¶
Modules that depend on optional packages use try/except imports with graceful fallbacks. This keeps the core SDK lightweight.
# telemetry/__init__.py
_OTEL_AVAILABLE = False
try:
from opentelemetry import trace, metrics
_OTEL_AVAILABLE = True
except ImportError:
pass
# graphistry/__init__.py -- similar pattern for pygraphistry
# api/nccl.py -- similar pattern for NCCL shared library
Graceful Degradation¶
When an optional dependency is missing, the SDK warns the user but continues
to function. For example, setup_telemetry() returns (None, None) if
OpenTelemetry is not installed, and NCCL functions fall back to stubs when
libnccl.so is unavailable.
Dataclass-Driven API¶
The SDK uses 15+ dataclasses and enums for structured data:
InferResult-- inference response wrapperMessage,Choice,Usage,Timings-- chat completion typesCompletionResponse,EmbeddingsResponse-- API response typesMultiGPUConfig,GPUInfo,SplitMode-- GPU configurationNCCLConfig,NCCLInfo-- NCCL configurationGGUFMetadata,GGUFTensorInfo,GGMLType-- model metadataServerPreset,PresetConfig-- server configuration presetsPerformanceSnapshot,InferenceRecord-- monitoring types
PyTorch-Style Auto-Configuration¶
On import, the package automatically:
- Detects and configures
LD_LIBRARY_PATHfor bundled shared libraries - Locates or downloads the
llama-serverbinary - Sets
LLAMA_SERVER_PATHin the environment - Creates model cache directories
This means users can start with import llamatelemetry and immediately begin
inference without manual path configuration.
Dependency Graph¶
llamatelemetry (core)
+-- numpy
+-- requests
+-- huggingface_hub
+-- tqdm
+-- opentelemetry-api (optional)
+-- opentelemetry-sdk (optional)
|
+-- telemetry/ --> opentelemetry-exporter-otlp (optional)
+-- graphistry/ --> pygraphistry, pandas (optional)
+-- graphistry/ --> cudf, cugraph (optional, RAPIDS)
+-- cuda/ --> triton (optional)
+-- unsloth/ --> torch, unsloth (optional)
+-- api/nccl.py --> libnccl.so (optional, via ctypes)
+-- api/client.py --> sseclient (optional, for streaming)
+-- kaggle/ --> ipywidgets (optional)
+-- telemetry/monitor.py --> pynvml (optional, GPU metrics)
Runtime Characteristics¶
- Kaggle-first: optimized for Tesla T4 x2 dual-GPU environments (SM 7.5, 30 GB total VRAM)
- Hybrid bootstrap: CUDA binaries are downloaded on first use from GitHub Releases
- Split-GPU architecture: GPU 0 runs LLM inference, GPU 1 runs Graphistry/RAPIDS analytics
- Small model focus: optimized for 1B--5B parameter GGUF models
- Python 3.11+: uses modern Python features and type hints throughout
Where to Go Next¶
- Project File Map -- every file in the package annotated
- Release Artifacts -- source and binary release contents
- Core API Reference -- InferenceEngine, ServerManager, LlamaCppClient
- Telemetry Guide -- OpenTelemetry setup and usage