Changelog¶
All notable changes to llamatelemetry are documented in this file.
For the authoritative source-level changelog see the GitHub repository: github.com/llamatelemetry/llamatelemetry/blob/main/CHANGELOG.md
v0.1.1 — 2026-02-02¶
Initial public release of the llamatelemetry Python SDK. This release establishes the full foundation: a CUDA-first OpenTelemetry SDK for LLM inference observability using llama.cpp as the backend.
Highlights¶
- High-level
InferenceEnginefor zero-boilerplate GGUF inference on CUDA GPUs - Auto-download T4-optimized llama.cpp binary bundle (~961 MB) on first import
- OpenTelemetry tracing and GPU metrics with 45
gen_ai.*semantic convention attributes - Kaggle T4 x2 presets for split-GPU inference workflows
- Graphistry + RAPIDS cuGraph visualization integration
- 18 Jupyter tutorial notebooks spanning foundation to observability workflows
- 246 passing tests across 7 test files
- MIT License
Core Package (llamatelemetry/__init__.py)¶
Added:
InferenceEngine— high-level interface for LLM inference lifecycleload_model(model_name_or_path, ...)— load GGUF models from registry, local path, or HuggingFaceinfer(prompt, max_tokens, temperature, ...)→InferResult— single-shot inferencegenerate(prompt, ...)— alias forinfer()infer_stream(prompt, ...)— streaming SSE inference with token callbackbatch_infer(prompts, ...)— parallel batch processingget_metrics()/reset_metrics()— wall-clock performance trackingunload_model()— explicit model cleanup- Full context manager support (
with InferenceEngine() as engine:) - Optional OpenTelemetry integration via
enable_telemetry=True InferResult— result dataclass with.success,.text,.tokens_generated,.latency_ms,.tokens_per_sec,.error_messageis_cuda_available()— check CUDA availability at runtimeget_cuda_device_count()— count available CUDA devices
Server Management (server.py)¶
Added:
ServerManager— complete lifecycle manager for thellama-serverbinary- Start, stop, health-check, and restart the HTTP server process
- Configurable:
host,port,gpu_layers,ctx_size,batch_size,ubatch_size - Multi-GPU:
split_mode,main_gpu,tensor_split,n_parallel - Performance:
flash_attn,cont_batching,mlock,no_mmap - Auto-download the CUDA binary bundle if not present (via
_internal.bootstrap) - Context manager support for clean startup/shutdown
- Prometheus
/metricsendpoint polling
API Client (api/client.py)¶
Added:
LlamaCppClient— full HTTP client for the llama-server REST API- Lazy sub-clients via properties:
.chat,.embeddings,.models,.slots,.lora - ChatCompletionsAPI — OpenAI-compatible
POST /v1/chat/completionswith streaming - EmbeddingsClientAPI —
POST /v1/embeddings, batch support - ModelsClientAPI —
GET /v1/models, model metadata - SlotsClientAPI — multi-slot KV cache management (
GET /slots,POST /slots/:id/save, etc.) - LoraClientAPI — LoRA adapter hot-swap (
GET /lora-adapters,POST /lora-adapters) - Native completion with 20+ sampling parameters: Mirostat, DRY, XTC, dynamic temperature, penalty ranges
- Grammar-guided generation (
grammar,json_schema) - SSE streaming with
sseclient - Full response dataclasses:
Message,Choice,Usage,Timings,CompletionResponse,TokenizeResponse, etc.
Multi-GPU (api/multigpu.py)¶
Added:
SplitModeenum —NONE,LAYER,ROWGPUInfodataclass — name, VRAM, compute capability per deviceMultiGPUConfigdataclass — unified multi-GPU config forInferenceEngineandServerManagerdetect_gpus()— enumerate CUDA devicesgpu_count()— count available GPUsget_cuda_version()— CUDA runtime versionget_total_vram()/get_free_vram()— VRAM accountingestimate_model_vram()— VRAM estimate from model size and quantizationcan_fit_model()— VRAM sufficiency checkrecommend_quantization()— auto-select GGUF quant type for available VRAMkaggle_t4_dual_config()— pre-tuned dual-T4 split-layer configcolab_t4_single_config()— single-T4 configauto_config()— automatic environment-aware config
GGUF Utilities (api/gguf.py)¶
Added:
GGMLType— enum for 30+ GGUF quantization types (Q2_KthroughIQ4_XS,F16,F32,BF16)GGUFValueType— metadata value type enumGGUFMetadata,GGUFTensorInfo,GGUFModelInfo— parsed model metadata dataclassesquantize(input_path, output_path, quant_type)— quantize an existing GGUFconvert_hf_to_gguf(model_dir, output_path)— convert HuggingFace checkpoint to GGUFmerge_lora(base_model, lora_path, output_path)— merge LoRA adapter into base GGUFgenerate_imatrix(model_path, dataset_path)— generate importance matrix for imatrix quantsgguf_report(model_path)— human-readable GGUF model reportreport_model_suitability(model_path)— Kaggle T4 suitability check
NCCL Integration (api/nccl.py)¶
Added:
NCCLDataTypeenum — maps Python types toncclDataType_tNCCLResultenum — NCCL return codesNCCLCommunicator— context manager wrapping a NCCL communicatorall_reduce(),broadcast(),reduce(),reduce_scatter()collective ops- Wraps
libnccl.so.2via ctypes (no PyTorch dependency) is_nccl_available()/get_nccl_version()— NCCL detectionget_nccl_info()/print_nccl_info()— diagnostic helperssetup_nccl_environment()— set recommended NCCL env varskaggle_nccl_config()— PCIe T4 preset config
Telemetry (telemetry/)¶
Added:
setup_telemetry()— initializeTracerProvider+MeterProviderwith OTLP exportersInferenceTracerProvider— GPU-aware tracer provider with inference span processorsInferenceTracer— helper for auto-populatinggen_ai.*span attributesGpuMetricsCollector— poll GPU utilization, memory, temperature, power viapynvmlPerformanceMonitor+PerformanceSnapshot/PerformanceReport— high-level monitoring context managerInstrumentedLLMClient— auto-traced wrapper aroundLlamaCppClientLlamaCppClientInstrumentor— OTel-style monkey-patching instrumentorGraphistryTraceExporter— OTelSpanExporterwriting traces to Graphistry graphssemconv.py— 45gen_ai.*attribute helpers:set_gen_ai_attr(),set_gen_ai_provider(),attr_name(),metric_name()- 5 Gen AI histogram metrics:
gen_ai.client.operation.duration,gen_ai.client.token.usage,gen_ai.server.request.duration,gen_ai.server.time_to_first_token,gen_ai.server.time_per_output_token setup_otlp_env_from_kaggle_secrets()— load OTLP credentials from Kaggle secretsis_otel_available()/is_graphistry_available()— optional dependency checks
Kaggle Module (kaggle/)¶
Added:
KaggleEnvironment— zero-boilerplate environment detection and setup for Kaggle notebooksis_kaggle()— detect Kaggle runtimequick_setup(hf_token, graphistry_token)— one-call environment initialization- GPU detection, binary verification, storage path resolution
KaggleSecrets— load HuggingFace and Graphistry credentials from Kaggle user secretssplit_gpu_session()context manager — dedicate GPU 0 to LLM inference, GPU 1 to visualizationServerPreset— named server configuration presets for common Kaggle model sizesTensorSplitMode— enum for Kaggle-specific split strategiesKagglePipelineConfig— full pipeline configuration (server, NCCL, OTLP, Graphistry)GPUContext— per-GPU context management for split workflows
Inference Optimization (inference/)¶
Added:
FlashAttentionConfig— configure FlashAttention v2/v3 (enabled viaflash_attn=Trueinload_model())KVCache/PagedKVCache— KV cache configuration and paged memory managementContinuousBatching— continuous batching settings for multi-slot serving
CUDA Optimization (cuda/)¶
Added:
CUDAGraph/GraphPool— capture and replay CUDA graphs for inference acceleration- Triton kernel wrappers for custom attention and matrix ops (
triton_kernels.py) TensorCoreConfig— configure TensorCore (FP16/BF16) usage
Quantization (quantization/)¶
Added:
NF4Quantizer— NF4 (4-bit NormalFloat) quantization for weight compression- GGUF conversion utilities (
quantization/gguf.py) — wrappingllama-quantize DynamicQuantizer— dynamic post-training quantization
Graphistry & RAPIDS (graphistry/)¶
Added:
GraphistryConnector— authenticate and upload graphs to Graphistry hub- RAPIDS
cuGraphintegration for GPU-accelerated graph algorithms GraphWorkload— graph workload management for split-GPU setups- Graph builders and visualization helpers
Louie AI (louie/)¶
Added:
LouieClient— client for the Louie.ai natural language graph analysis servicenatural_query()— query knowledge graphs with natural languageKnowledgeExtractor— extract structured knowledge graphs from LLM output
Unsloth Integration (unsloth/)¶
Added:
- Model loader wrapping
unslothfor efficient 4-bit fine-tuning - LoRA adapter application and merging
- GGUF export pipeline: fine-tuned model → quantized GGUF → llamatelemetry deployment
Model Registry (_internal/registry.py)¶
Added:
30+ curated GGUF models in MODEL_REGISTRY, including:
| Family | Sizes |
|---|---|
| Gemma 3 (Google) | 1B, 4B, 12B (Q4_K_M, Q8_0) |
| Llama 3.1 / 3.2 (Meta) | 3B, 8B, 70B (Q4_K_M, Q5_K_M) |
| Phi-3.5 / Phi-4 (Microsoft) | 3.8B, 14B (Q4_K_M) |
| Mistral / Mixtral | 7B, 8×7B (Q4_K_M) |
| Qwen 2.5 (Alibaba) | 7B, 14B (Q4_K_M) |
| DeepSeek R1 Distill | 1.5B, 7B (Q4_K_M) |
Bootstrap (_internal/bootstrap.py)¶
Added:
- Auto-download T4-optimized binary bundle (~961 MB) from HuggingFace on first import
- GitHub fallback mirror for reliability
- SHA256 integrity verification
- CUDA compute capability check (requires SM 7.5+)
- Platform detection: Kaggle, Colab, local Linux
C++/CUDA Extension (csrc/, llamatelemetry_cpp)¶
Added:
llamatelemetry_cpppybind11 module compiled for CUDA 12.x, SM 7.5 (Tesla T4)Deviceclass —get_device_count(),get_device_properties(),set_device(),synchronize(),get_free_memory(),get_total_memory()Tensorclass — RAII tensor with shape, strides, dtype, device;.to(device),.cpu(),Tensor.zeros(),Tensor.ones(),Tensor.from_ptr()DTypeenum —Float32,Float16,BFloat16,Int32,Int64,UInt8matmul()— cuBLAS SGEMM (FP32) and HGEMM (FP16), batched variants- Static linking:
cudart_static,cublas_static,cublasLt_static,culibos
Notebooks (18 tutorials)¶
Added:
| # | Notebook | Track |
|---|---|---|
| 01 | Quickstart — llamatelemetry v0.1.1 | Foundation |
| 02 | llama-server setup | Foundation |
| 03 | Multi-GPU inference | Foundation |
| 04 | GGUF quantization | Foundation |
| 05 | Unsloth integration | Integration |
| 06 | Split-GPU + Graphistry | Integration |
| 07 | Knowledge graph extraction | Integration |
| 08 | Document network analysis | Integration |
| 09 | Large models on Kaggle | Advanced |
| 10 | Complete workflow | Advanced |
| 11 | GGUF neural network visualization | Advanced |
| 12 | Attention mechanism explorer | Advanced |
| 13 | Token embedding visualizer | Advanced |
| 14 | OpenTelemetry LLM observability | Observability |
| 15 | Real-time performance monitoring | Observability |
| 16 | Production observability | Observability |
| 17 | llamatelemetry + W&B on Kaggle | Observability |
| 18 | OTel + Graphistry trace glue | Observability |
Tests¶
Added:
- 246 passing tests, 24 skipped (CUDA/GPU-only paths)
test_llamatelemetry.py(12,458 lines) — imports, platform detection, GPU compat, binary download, server/engine lifecycle, metricstest_new_apis.py(7,496 lines) — quantization, Unsloth, CUDA, and inference API coveragetest_tensor_api.py(5,022 lines) — C++ extension: Device, Tensor, matmul, memory managementtest_gguf_parser.py(9,438 lines) — GGUF format parser correctnesstest_full_workflow.py(1,344 lines) — end-to-end with a real model binarytest_end_to_end.py(3,087 lines) — end-to-end inference test
Documentation¶
Added:
- Full MkDocs Material documentation site: llamatelemetry.github.io
- Get Started section: Overview, Installation, Quickstart, Kaggle Quickstart
- 14 Guides: Inference Engine, Server Management, Model Management, API Client, Telemetry, Kaggle, Examples, Graphistry/RAPIDS, Quantization, Unsloth, CUDA Optimizations, Jupyter Workflows, Louie Knowledge Graphs, Troubleshooting
- 13 API Reference pages covering all public APIs
- Notebook Hub with 18 categorized tutorials
- Project section: Architecture, File Map, Release Artifacts, FAQ, Contributing, Changelog
Dependencies (v0.1.1)¶
Core (always installed):
numpy>=1.24
requests>=2.28
huggingface_hub>=0.20
tqdm>=4.64
opentelemetry-api>=1.20.0
opentelemetry-sdk>=1.20.0
Optional (install extras as needed):
opentelemetry-exporter-otlp-proto-http # OTLP HTTP trace/metrics export
opentelemetry-semantic-conventions # Gen AI semconv attributes
pygraphistry # Graphistry visualization
pandas, matplotlib, scikit-learn # Data analysis
ipywidgets # Jupyter chat widget
torch # Unsloth integration
pynvml # GPU metrics collection
sseclient # SSE streaming support