API Reference¶
Complete API reference for the llamatelemetry Python SDK v0.1.1. This section documents
every public class, function, dataclass, and enum across all 10 modules.
Module Map¶
| Module | Description | Reference Page |
|---|---|---|
llamatelemetry |
High-level InferenceEngine, InferResult, and top-level helpers |
Core API |
llamatelemetry.server |
ServerManager, ModelInfo, ModelManager, SmartModelDownloader |
Server & Models |
llamatelemetry.api.client |
LlamaCppClient, sub-APIs, ChatEngine, ConversationManager, EmbeddingEngine, SemanticSearch |
Client API |
llamatelemetry.api.gguf |
GGUF parsing, validation, quantization matrix, GGUFReader, enums |
GGUF API |
llamatelemetry.api.multigpu |
MultiGPUConfig, SplitMode, GPU detection and presets |
Multi-GPU & NCCL |
llamatelemetry.api.nccl |
NCCLConfig, NCCLCommunicator, collective operations |
Multi-GPU & NCCL |
llamatelemetry.telemetry |
OpenTelemetry setup, InstrumentedLlamaCppClient, PerformanceMonitor, GpuMetricsCollector, GraphistryTraceExporter |
Telemetry API |
llamatelemetry.kaggle |
KaggleEnvironment, ServerPreset, pipeline config, secrets |
Kaggle API |
llamatelemetry.graphistry |
GraphistrySession, GraphistryBuilders, GraphistryViz, RAPIDSBackend, SplitGPUManager |
Graphistry API |
llamatelemetry.quantization |
NF4Quantizer, GGUFConverter, DynamicQuantizer, strategies |
Quantization & Unsloth |
llamatelemetry.unsloth |
UnslothModelLoader, UnslothExporter, LoRAAdapter |
Quantization & Unsloth |
llamatelemetry.cuda |
CUDAGraph, TritonKernel, TensorCoreConfig |
CUDA & Inference |
llamatelemetry.inference |
FlashAttentionConfig, KVCache, ContinuousBatching, BatchInferenceOptimizer |
CUDA & Inference |
llamatelemetry.jupyter |
Jupyter widgets, streaming helpers, visualization | Jupyter, Chat & Embeddings |
llamatelemetry.chat |
ChatEngine, ConversationManager, Message |
Jupyter, Chat & Embeddings |
llamatelemetry.embeddings |
EmbeddingEngine, SemanticSearch, TextClustering, similarity functions |
Jupyter, Chat & Embeddings |
llamatelemetry.louie |
LouieClient, KnowledgeExtractor, KnowledgeGraph, entity/relation enums |
Louie API |
Quick Links by Category¶
Core¶
- InferenceEngine -- primary entry point for all inference
- InferResult -- result wrapper returned by all inference methods
- ServerManager -- llama-server process lifecycle
- LlamaCppClient -- low-level HTTP client for llama-server
Model Management¶
- ModelInfo -- GGUF metadata parser
- ModelManager -- scan and filter local models
- SmartModelDownloader -- VRAM-aware download
- GGUFReader -- memory-mapped GGUF file reader
Multi-GPU¶
- MultiGPUConfig -- GPU split configuration
- NCCLCommunicator -- NCCL collective operations
Observability¶
- setup_grafana_otlp -- one-call telemetry setup
- PerformanceMonitor -- real-time performance tracking
- GpuMetricsCollector -- GPU metric collection
- GraphistryTraceExporter -- trace visualization
Environments¶
- KaggleEnvironment -- Kaggle platform detection
- ServerPreset -- environment-specific presets
Graph & Knowledge¶
- GraphistrySession -- Graphistry connection
- GraphistryBuilders -- graph construction helpers
- LouieClient -- natural language graph queries
- KnowledgeExtractor -- entity/relationship extraction
Quantization & Fine-tuning¶
- NF4Quantizer -- NormalFloat4 quantization
- GGUFConverter -- model-to-GGUF conversion
- DynamicQuantizer -- VRAM-aware quantization
- UnslothModelLoader -- Unsloth model loading
- LoRAAdapter -- LoRA adapter management
CUDA & Inference¶
- CUDAGraph -- CUDA graph capture and replay
- FlashAttentionConfig -- flash attention setup
- KVCache -- KV cache management
- ContinuousBatching -- continuous batching
Interactive¶
- ChatEngine -- multi-turn conversation
- EmbeddingEngine -- text embeddings
- SemanticSearch -- vector similarity search
Conventions¶
- All classes support Python context managers where noted with
__enter__/__exit__. - Optional dependencies degrade gracefully; import errors are caught and features are disabled.
- GPU operations require CUDA 12.x and a compute capability >= 7.5 device (e.g., Tesla T4).
- All timeout parameters are in seconds unless otherwise noted.
- All
**kwargsare forwarded to the underlying llama-server HTTP API.
Version¶
This reference documents llamatelemetry v0.1.1 (released 2026-02-02).