Changelog¶

All notable changes to llamatelemetry are documented in this file.

For the authoritative source-level changelog see the GitHub repository: github.com/llamatelemetry/llamatelemetry/blob/main/CHANGELOG.md

v0.1.1 — 2026-02-02¶

Initial public release of the llamatelemetry Python SDK. This release establishes the full foundation: a CUDA-first OpenTelemetry SDK for LLM inference observability using llama.cpp as the backend.

Highlights¶

High-level InferenceEngine for zero-boilerplate GGUF inference on CUDA GPUs
Auto-download T4-optimized llama.cpp binary bundle (~961 MB) on first import
OpenTelemetry tracing and GPU metrics with 45 gen_ai.* semantic convention attributes
Kaggle T4 x2 presets for split-GPU inference workflows
Graphistry + RAPIDS cuGraph visualization integration
18 Jupyter tutorial notebooks spanning foundation to observability workflows
246 passing tests across 7 test files
MIT License

Core Package (`llamatelemetry/init.py`)¶

Added:

InferenceEngine — high-level interface for LLM inference lifecycle
load_model(model_name_or_path, ...) — load GGUF models from registry, local path, or HuggingFace
infer(prompt, max_tokens, temperature, ...) → InferResult — single-shot inference
generate(prompt, ...) — alias for infer()
infer_stream(prompt, ...) — streaming SSE inference with token callback
batch_infer(prompts, ...) — parallel batch processing
get_metrics() / reset_metrics() — wall-clock performance tracking
unload_model() — explicit model cleanup
Full context manager support (with InferenceEngine() as engine:)
Optional OpenTelemetry integration via enable_telemetry=True
InferResult — result dataclass with .success, .text, .tokens_generated, .latency_ms, .tokens_per_sec, .error_message
is_cuda_available() — check CUDA availability at runtime
get_cuda_device_count() — count available CUDA devices

Server Management (`server.py`)¶

Added:

ServerManager — complete lifecycle manager for the llama-server binary
Start, stop, health-check, and restart the HTTP server process
Configurable: host, port, gpu_layers, ctx_size, batch_size, ubatch_size
Multi-GPU: split_mode, main_gpu, tensor_split, n_parallel
Performance: flash_attn, cont_batching, mlock, no_mmap
Auto-download the CUDA binary bundle if not present (via _internal.bootstrap)
Context manager support for clean startup/shutdown
Prometheus /metrics endpoint polling

API Client (`api/client.py`)¶

Added:

LlamaCppClient — full HTTP client for the llama-server REST API
Lazy sub-clients via properties: .chat, .embeddings, .models, .slots, .lora
ChatCompletionsAPI — OpenAI-compatible POST /v1/chat/completions with streaming
EmbeddingsClientAPI — POST /v1/embeddings, batch support
ModelsClientAPI — GET /v1/models, model metadata
SlotsClientAPI — multi-slot KV cache management (GET /slots, POST /slots/:id/save, etc.)
LoraClientAPI — LoRA adapter hot-swap (GET /lora-adapters, POST /lora-adapters)
Native completion with 20+ sampling parameters: Mirostat, DRY, XTC, dynamic temperature, penalty ranges
Grammar-guided generation (grammar, json_schema)
SSE streaming with sseclient
Full response dataclasses: Message, Choice, Usage, Timings, CompletionResponse, TokenizeResponse, etc.

Multi-GPU (`api/multigpu.py`)¶

Added:

SplitMode enum — NONE, LAYER, ROW
GPUInfo dataclass — name, VRAM, compute capability per device
MultiGPUConfig dataclass — unified multi-GPU config for InferenceEngine and ServerManager
detect_gpus() — enumerate CUDA devices
gpu_count() — count available GPUs
get_cuda_version() — CUDA runtime version
get_total_vram() / get_free_vram() — VRAM accounting
estimate_model_vram() — VRAM estimate from model size and quantization
can_fit_model() — VRAM sufficiency check
recommend_quantization() — auto-select GGUF quant type for available VRAM
kaggle_t4_dual_config() — pre-tuned dual-T4 split-layer config
colab_t4_single_config() — single-T4 config
auto_config() — automatic environment-aware config

GGUF Utilities (`api/gguf.py`)¶

Added:

GGMLType — enum for 30+ GGUF quantization types (Q2_K through IQ4_XS, F16, F32, BF16)
GGUFValueType — metadata value type enum
GGUFMetadata, GGUFTensorInfo, GGUFModelInfo — parsed model metadata dataclasses
quantize(input_path, output_path, quant_type) — quantize an existing GGUF
convert_hf_to_gguf(model_dir, output_path) — convert HuggingFace checkpoint to GGUF
merge_lora(base_model, lora_path, output_path) — merge LoRA adapter into base GGUF
generate_imatrix(model_path, dataset_path) — generate importance matrix for imatrix quants
gguf_report(model_path) — human-readable GGUF model report
report_model_suitability(model_path) — Kaggle T4 suitability check

NCCL Integration (`api/nccl.py`)¶

Added:

NCCLDataType enum — maps Python types to ncclDataType_t
NCCLResult enum — NCCL return codes
NCCLCommunicator — context manager wrapping a NCCL communicator
all_reduce(), broadcast(), reduce(), reduce_scatter() collective ops
Wraps libnccl.so.2 via ctypes (no PyTorch dependency)
is_nccl_available() / get_nccl_version() — NCCL detection
get_nccl_info() / print_nccl_info() — diagnostic helpers
setup_nccl_environment() — set recommended NCCL env vars
kaggle_nccl_config() — PCIe T4 preset config

Telemetry (`telemetry/`)¶

Added:

setup_telemetry() — initialize TracerProvider + MeterProvider with OTLP exporters
InferenceTracerProvider — GPU-aware tracer provider with inference span processors
InferenceTracer — helper for auto-populating gen_ai.* span attributes
GpuMetricsCollector — poll GPU utilization, memory, temperature, power via pynvml
PerformanceMonitor + PerformanceSnapshot / PerformanceReport — high-level monitoring context manager
InstrumentedLLMClient — auto-traced wrapper around LlamaCppClient
LlamaCppClientInstrumentor — OTel-style monkey-patching instrumentor
GraphistryTraceExporter — OTel SpanExporter writing traces to Graphistry graphs
semconv.py — 45 gen_ai.* attribute helpers: set_gen_ai_attr(), set_gen_ai_provider(), attr_name(), metric_name()
5 Gen AI histogram metrics: gen_ai.client.operation.duration, gen_ai.client.token.usage, gen_ai.server.request.duration, gen_ai.server.time_to_first_token, gen_ai.server.time_per_output_token
setup_otlp_env_from_kaggle_secrets() — load OTLP credentials from Kaggle secrets
is_otel_available() / is_graphistry_available() — optional dependency checks

Kaggle Module (`kaggle/`)¶

Added:

KaggleEnvironment — zero-boilerplate environment detection and setup for Kaggle notebooks
is_kaggle() — detect Kaggle runtime
quick_setup(hf_token, graphistry_token) — one-call environment initialization
GPU detection, binary verification, storage path resolution
KaggleSecrets — load HuggingFace and Graphistry credentials from Kaggle user secrets
split_gpu_session() context manager — dedicate GPU 0 to LLM inference, GPU 1 to visualization
ServerPreset — named server configuration presets for common Kaggle model sizes
TensorSplitMode — enum for Kaggle-specific split strategies
KagglePipelineConfig — full pipeline configuration (server, NCCL, OTLP, Graphistry)
GPUContext — per-GPU context management for split workflows

Inference Optimization (`inference/`)¶

Added:

FlashAttentionConfig — configure FlashAttention v2/v3 (enabled via flash_attn=True in load_model())
KVCache / PagedKVCache — KV cache configuration and paged memory management
ContinuousBatching — continuous batching settings for multi-slot serving

CUDA Optimization (`cuda/`)¶

Added:

CUDAGraph / GraphPool — capture and replay CUDA graphs for inference acceleration
Triton kernel wrappers for custom attention and matrix ops (triton_kernels.py)
TensorCoreConfig — configure TensorCore (FP16/BF16) usage

Quantization (`quantization/`)¶

Added:

NF4Quantizer — NF4 (4-bit NormalFloat) quantization for weight compression
GGUF conversion utilities (quantization/gguf.py) — wrapping llama-quantize
DynamicQuantizer — dynamic post-training quantization

Graphistry & RAPIDS (`graphistry/`)¶

Added:

GraphistryConnector — authenticate and upload graphs to Graphistry hub
RAPIDS cuGraph integration for GPU-accelerated graph algorithms
GraphWorkload — graph workload management for split-GPU setups
Graph builders and visualization helpers

Louie AI (`louie/`)¶

Added:

LouieClient — client for the Louie.ai natural language graph analysis service
natural_query() — query knowledge graphs with natural language
KnowledgeExtractor — extract structured knowledge graphs from LLM output

Unsloth Integration (`unsloth/`)¶

Added:

Model loader wrapping unsloth for efficient 4-bit fine-tuning
LoRA adapter application and merging
GGUF export pipeline: fine-tuned model → quantized GGUF → llamatelemetry deployment

Model Registry (`_internal/registry.py`)¶

Added:

30+ curated GGUF models in MODEL_REGISTRY, including:

Family	Sizes
Gemma 3 (Google)	1B, 4B, 12B (Q4_K_M, Q8_0)
Llama 3.1 / 3.2 (Meta)	3B, 8B, 70B (Q4_K_M, Q5_K_M)
Phi-3.5 / Phi-4 (Microsoft)	3.8B, 14B (Q4_K_M)
Mistral / Mixtral	7B, 8×7B (Q4_K_M)
Qwen 2.5 (Alibaba)	7B, 14B (Q4_K_M)
DeepSeek R1 Distill	1.5B, 7B (Q4_K_M)

Bootstrap (`_internal/bootstrap.py`)¶

Added:

Auto-download T4-optimized binary bundle (~961 MB) from HuggingFace on first import
GitHub fallback mirror for reliability
SHA256 integrity verification
CUDA compute capability check (requires SM 7.5+)
Platform detection: Kaggle, Colab, local Linux

C++/CUDA Extension (`csrc/`, `llamatelemetry_cpp`)¶

Added:

llamatelemetry_cpp pybind11 module compiled for CUDA 12.x, SM 7.5 (Tesla T4)
Device class — get_device_count(), get_device_properties(), set_device(), synchronize(), get_free_memory(), get_total_memory()
Tensor class — RAII tensor with shape, strides, dtype, device; .to(device), .cpu(), Tensor.zeros(), Tensor.ones(), Tensor.from_ptr()
DType enum — Float32, Float16, BFloat16, Int32, Int64, UInt8
matmul() — cuBLAS SGEMM (FP32) and HGEMM (FP16), batched variants
Static linking: cudart_static, cublas_static, cublasLt_static, culibos

Notebooks (18 tutorials)¶

Added:

#	Notebook	Track
01	Quickstart — llamatelemetry v0.1.1	Foundation
02	llama-server setup	Foundation
03	Multi-GPU inference	Foundation
04	GGUF quantization	Foundation
05	Unsloth integration	Integration
06	Split-GPU + Graphistry	Integration
07	Knowledge graph extraction	Integration
08	Document network analysis	Integration
09	Large models on Kaggle	Advanced
10	Complete workflow	Advanced
11	GGUF neural network visualization	Advanced
12	Attention mechanism explorer	Advanced
13	Token embedding visualizer	Advanced
14	OpenTelemetry LLM observability	Observability
15	Real-time performance monitoring	Observability
16	Production observability	Observability
17	llamatelemetry + W&B on Kaggle	Observability
18	OTel + Graphistry trace glue	Observability

Tests¶

Added:

246 passing tests, 24 skipped (CUDA/GPU-only paths)
test_llamatelemetry.py (12,458 lines) — imports, platform detection, GPU compat, binary download, server/engine lifecycle, metrics
test_new_apis.py (7,496 lines) — quantization, Unsloth, CUDA, and inference API coverage
test_tensor_api.py (5,022 lines) — C++ extension: Device, Tensor, matmul, memory management
test_gguf_parser.py (9,438 lines) — GGUF format parser correctness
test_full_workflow.py (1,344 lines) — end-to-end with a real model binary
test_end_to_end.py (3,087 lines) — end-to-end inference test

Documentation¶

Added:

Full MkDocs Material documentation site: llamatelemetry.github.io
Get Started section: Overview, Installation, Quickstart, Kaggle Quickstart
14 Guides: Inference Engine, Server Management, Model Management, API Client, Telemetry, Kaggle, Examples, Graphistry/RAPIDS, Quantization, Unsloth, CUDA Optimizations, Jupyter Workflows, Louie Knowledge Graphs, Troubleshooting
13 API Reference pages covering all public APIs
Notebook Hub with 18 categorized tutorials
Project section: Architecture, File Map, Release Artifacts, FAQ, Contributing, Changelog

Dependencies (v0.1.1)¶

Core (always installed):

numpy>=1.24
requests>=2.28
huggingface_hub>=0.20
tqdm>=4.64
opentelemetry-api>=1.20.0
opentelemetry-sdk>=1.20.0

Optional (install extras as needed):

opentelemetry-exporter-otlp-proto-http  # OTLP HTTP trace/metrics export
opentelemetry-semantic-conventions      # Gen AI semconv attributes
pygraphistry                            # Graphistry visualization
pandas, matplotlib, scikit-learn        # Data analysis
ipywidgets                              # Jupyter chat widget
torch                                   # Unsloth integration
pynvml                                  # GPU metrics collection
sseclient                               # SSE streaming support

Changelog¶

v0.1.1 — 2026-02-02¶

Highlights¶

Core Package (llamatelemetry/__init__.py)¶

Server Management (server.py)¶

API Client (api/client.py)¶

Multi-GPU (api/multigpu.py)¶

GGUF Utilities (api/gguf.py)¶

NCCL Integration (api/nccl.py)¶

Telemetry (telemetry/)¶

Kaggle Module (kaggle/)¶

Inference Optimization (inference/)¶

CUDA Optimization (cuda/)¶

Quantization (quantization/)¶

Graphistry & RAPIDS (graphistry/)¶

Louie AI (louie/)¶

Unsloth Integration (unsloth/)¶

Model Registry (_internal/registry.py)¶

Bootstrap (_internal/bootstrap.py)¶

C++/CUDA Extension (csrc/, llamatelemetry_cpp)¶

Notebooks (18 tutorials)¶

Tests¶

Documentation¶

Dependencies (v0.1.1)¶

Core Package (`llamatelemetry/init.py`)¶

Server Management (`server.py`)¶

API Client (`api/client.py`)¶

Multi-GPU (`api/multigpu.py`)¶

GGUF Utilities (`api/gguf.py`)¶

NCCL Integration (`api/nccl.py`)¶

Telemetry (`telemetry/`)¶

Kaggle Module (`kaggle/`)¶

Inference Optimization (`inference/`)¶

CUDA Optimization (`cuda/`)¶

Quantization (`quantization/`)¶

Graphistry & RAPIDS (`graphistry/`)¶

Louie AI (`louie/`)¶

Unsloth Integration (`unsloth/`)¶

Model Registry (`_internal/registry.py`)¶

Bootstrap (`_internal/bootstrap.py`)¶

C++/CUDA Extension (`csrc/`, `llamatelemetry_cpp`)¶