API Reference¶

Complete API reference for the llamatelemetry Python SDK v0.1.1. This section documents every public class, function, dataclass, and enum across all 10 modules.

Module Map¶

Module	Description	Reference Page
`llamatelemetry`	High-level `InferenceEngine`, `InferResult`, and top-level helpers	Core API
`llamatelemetry.server`	`ServerManager`, `ModelInfo`, `ModelManager`, `SmartModelDownloader`	Server & Models
`llamatelemetry.api.client`	`LlamaCppClient`, sub-APIs, `ChatEngine`, `ConversationManager`, `EmbeddingEngine`, `SemanticSearch`	Client API
`llamatelemetry.api.gguf`	GGUF parsing, validation, quantization matrix, `GGUFReader`, enums	GGUF API
`llamatelemetry.api.multigpu`	`MultiGPUConfig`, `SplitMode`, GPU detection and presets	Multi-GPU & NCCL
`llamatelemetry.api.nccl`	`NCCLConfig`, `NCCLCommunicator`, collective operations	Multi-GPU & NCCL
`llamatelemetry.telemetry`	OpenTelemetry setup, `InstrumentedLlamaCppClient`, `PerformanceMonitor`, `GpuMetricsCollector`, `GraphistryTraceExporter`	Telemetry API
`llamatelemetry.kaggle`	`KaggleEnvironment`, `ServerPreset`, pipeline config, secrets	Kaggle API
`llamatelemetry.graphistry`	`GraphistrySession`, `GraphistryBuilders`, `GraphistryViz`, `RAPIDSBackend`, `SplitGPUManager`	Graphistry API
`llamatelemetry.quantization`	`NF4Quantizer`, `GGUFConverter`, `DynamicQuantizer`, strategies	Quantization & Unsloth
`llamatelemetry.unsloth`	`UnslothModelLoader`, `UnslothExporter`, `LoRAAdapter`	Quantization & Unsloth
`llamatelemetry.cuda`	`CUDAGraph`, `TritonKernel`, `TensorCoreConfig`	CUDA & Inference
`llamatelemetry.inference`	`FlashAttentionConfig`, `KVCache`, `ContinuousBatching`, `BatchInferenceOptimizer`	CUDA & Inference
`llamatelemetry.jupyter`	Jupyter widgets, streaming helpers, visualization	Jupyter, Chat & Embeddings
`llamatelemetry.chat`	`ChatEngine`, `ConversationManager`, `Message`	Jupyter, Chat & Embeddings
`llamatelemetry.embeddings`	`EmbeddingEngine`, `SemanticSearch`, `TextClustering`, similarity functions	Jupyter, Chat & Embeddings
`llamatelemetry.louie`	`LouieClient`, `KnowledgeExtractor`, `KnowledgeGraph`, entity/relation enums	Louie API

Quick Links by Category¶

Core¶

InferenceEngine -- primary entry point for all inference
InferResult -- result wrapper returned by all inference methods
ServerManager -- llama-server process lifecycle
LlamaCppClient -- low-level HTTP client for llama-server

Model Management¶

ModelInfo -- GGUF metadata parser
ModelManager -- scan and filter local models
SmartModelDownloader -- VRAM-aware download
GGUFReader -- memory-mapped GGUF file reader

Multi-GPU¶

MultiGPUConfig -- GPU split configuration
NCCLCommunicator -- NCCL collective operations

Observability¶

setup_grafana_otlp -- one-call telemetry setup
PerformanceMonitor -- real-time performance tracking
GpuMetricsCollector -- GPU metric collection
GraphistryTraceExporter -- trace visualization

Environments¶

KaggleEnvironment -- Kaggle platform detection
ServerPreset -- environment-specific presets

Graph & Knowledge¶

GraphistrySession -- Graphistry connection
GraphistryBuilders -- graph construction helpers
LouieClient -- natural language graph queries
KnowledgeExtractor -- entity/relationship extraction

Quantization & Fine-tuning¶

NF4Quantizer -- NormalFloat4 quantization
GGUFConverter -- model-to-GGUF conversion
DynamicQuantizer -- VRAM-aware quantization
UnslothModelLoader -- Unsloth model loading
LoRAAdapter -- LoRA adapter management

CUDA & Inference¶

CUDAGraph -- CUDA graph capture and replay
FlashAttentionConfig -- flash attention setup
KVCache -- KV cache management
ContinuousBatching -- continuous batching

Interactive¶

ChatEngine -- multi-turn conversation
EmbeddingEngine -- text embeddings
SemanticSearch -- vector similarity search

Conventions¶

All classes support Python context managers where noted with __enter__ / __exit__.
Optional dependencies degrade gracefully; import errors are caught and features are disabled.
GPU operations require CUDA 12.x and a compute capability >= 7.5 device (e.g., Tesla T4).
All timeout parameters are in seconds unless otherwise noted.
All **kwargs are forwarded to the underlying llama-server HTTP API.

Version¶

This reference documents llamatelemetry v0.1.1 (released 2026-02-02).