Skip to content

API Reference

Complete API reference for the llamatelemetry Python SDK v0.1.1. This section documents every public class, function, dataclass, and enum across all 10 modules.

Module Map

Module Description Reference Page
llamatelemetry High-level InferenceEngine, InferResult, and top-level helpers Core API
llamatelemetry.server ServerManager, ModelInfo, ModelManager, SmartModelDownloader Server & Models
llamatelemetry.api.client LlamaCppClient, sub-APIs, ChatEngine, ConversationManager, EmbeddingEngine, SemanticSearch Client API
llamatelemetry.api.gguf GGUF parsing, validation, quantization matrix, GGUFReader, enums GGUF API
llamatelemetry.api.multigpu MultiGPUConfig, SplitMode, GPU detection and presets Multi-GPU & NCCL
llamatelemetry.api.nccl NCCLConfig, NCCLCommunicator, collective operations Multi-GPU & NCCL
llamatelemetry.telemetry OpenTelemetry setup, InstrumentedLlamaCppClient, PerformanceMonitor, GpuMetricsCollector, GraphistryTraceExporter Telemetry API
llamatelemetry.kaggle KaggleEnvironment, ServerPreset, pipeline config, secrets Kaggle API
llamatelemetry.graphistry GraphistrySession, GraphistryBuilders, GraphistryViz, RAPIDSBackend, SplitGPUManager Graphistry API
llamatelemetry.quantization NF4Quantizer, GGUFConverter, DynamicQuantizer, strategies Quantization & Unsloth
llamatelemetry.unsloth UnslothModelLoader, UnslothExporter, LoRAAdapter Quantization & Unsloth
llamatelemetry.cuda CUDAGraph, TritonKernel, TensorCoreConfig CUDA & Inference
llamatelemetry.inference FlashAttentionConfig, KVCache, ContinuousBatching, BatchInferenceOptimizer CUDA & Inference
llamatelemetry.jupyter Jupyter widgets, streaming helpers, visualization Jupyter, Chat & Embeddings
llamatelemetry.chat ChatEngine, ConversationManager, Message Jupyter, Chat & Embeddings
llamatelemetry.embeddings EmbeddingEngine, SemanticSearch, TextClustering, similarity functions Jupyter, Chat & Embeddings
llamatelemetry.louie LouieClient, KnowledgeExtractor, KnowledgeGraph, entity/relation enums Louie API

Core

Model Management

Multi-GPU

Observability

Environments

Graph & Knowledge

Quantization & Fine-tuning

CUDA & Inference

Interactive

Conventions

  • All classes support Python context managers where noted with __enter__ / __exit__.
  • Optional dependencies degrade gracefully; import errors are caught and features are disabled.
  • GPU operations require CUDA 12.x and a compute capability >= 7.5 device (e.g., Tesla T4).
  • All timeout parameters are in seconds unless otherwise noted.
  • All **kwargs are forwarded to the underlying llama-server HTTP API.

Version

This reference documents llamatelemetry v0.1.1 (released 2026-02-02).