Get Started¶
Welcome to llamatelemetry, a CUDA-first OpenTelemetry SDK for LLM inference observability. This section walks you through installation, your first inference call, and the Kaggle-optimized workflow, so you can move from zero to running GPU-accelerated language models in minutes.
What is llamatelemetry?¶
llamatelemetry is a Python orchestration layer built on top of
llama.cpp. It automates binary
bootstrapping, model discovery, server lifecycle management, inference requests,
and OpenTelemetry-based observability---all from a single pip install.
The SDK targets NVIDIA GPUs with CUDA 12.x and is production-tested on Tesla T4
hardware (SM 7.5), including Kaggle dual-T4 notebook environments. It exposes a
high-level InferenceEngine API for quick prototyping and a lower-level
LlamaCppClient for full control over the OpenAI-compatible llama-server REST
interface.
Who is it for?¶
- ML engineers who need fast, quantized LLM inference on consumer or cloud GPUs without writing C++ code.
- Platform teams who want standardized OpenTelemetry traces and metrics (following the Gen AI semantic conventions) across inference workloads.
- Kaggle competitors and notebook authors who need reproducible, GPU-aware pipelines with minimal boilerplate.
- Researchers exploring quantization, fine-tuning (Unsloth/LoRA), or graph-based knowledge extraction workflows.
SDK architecture at a glance¶
llamatelemetry ships approximately 40 Python source files and 7 C++/CUDA files, organized into 10 modules. Each module is self-contained and imports only what it needs, so you can use the inference engine without ever touching the graph visualization layer, and vice versa.
| Module | Purpose |
|---|---|
| api | LlamaCppClient (OpenAI-compatible + native llama.cpp API), MultiGPUConfig (split modes, GPU detection, presets), GGUF utilities, NCCL collective operations |
| telemetry | OpenTelemetry instrumentation: 45 gen_ai.* span attributes, 5 metrics instruments, auto-instrumentation hooks, OTLP exporters |
| kaggle | KaggleEnvironment detection, ServerPreset enum, split_gpu_session context manager, Kaggle secrets integration |
| inference | FlashAttention configuration, KV cache management, continuous batching helpers |
| cuda | CUDAGraph capture/replay, Triton kernel launchers, TensorCore utilities |
| quantization | NF4 quantization, GGUF format conversion, dynamic quantization policies |
| graphistry | Graph visualization with Graphistry and RAPIDS cuGraph integration |
| louie | AI-driven graph analysis, knowledge extraction pipelines |
| unsloth | Fine-tuning orchestration, LoRA adapter management, GGUF export |
| _internal | Bootstrap logic (auto-downloads ~961 MB of binaries on first import), MODEL_REGISTRY with 22+ curated model entries |
C++/CUDA extension¶
The llamatelemetry_cpp pybind11 module provides device operations, a Tensor
RAII wrapper supporting 6 data types, and cuBLAS matrix multiplication kernels
(SGEMM and HGEMM). It links against cudart_static, cublas_static, and
cublasLt_static.
System requirements¶
Minimum¶
| Component | Requirement |
|---|---|
| Python | >= 3.11 |
| OS | Linux (Ubuntu 20.04+ recommended) |
| NVIDIA driver | >= 525.x |
| CUDA toolkit | 12.x |
| GPU | Any NVIDIA GPU with compute capability >= 6.1 |
| RAM | 8 GB system memory |
| Disk | 2 GB free (binaries + one small model) |
Recommended¶
| Component | Recommendation |
|---|---|
| GPU | Tesla T4 (16 GB VRAM, SM 7.5) or better |
| VRAM | 16 GB per GPU for 7B-parameter models at Q4 quantization |
| Disk | 10 GB+ free for multiple model downloads |
| Network | Broadband for first-run bootstrap and model downloads |
Kaggle¶
Kaggle notebooks with the GPU T4 x2 accelerator are the primary tested
environment. Each T4 provides 16 GB VRAM and SM 7.5 compute capability. The SDK
includes presets (KAGGLE_DUAL_T4, KAGGLE_SINGLE_T4) that automatically
configure server parameters, GPU layer counts, and context sizes for these
machines.
The shortest path from install to inference¶
The complete workflow has five steps:
- Install the SDK from GitHub with
pip. - Verify that CUDA and your GPU are visible to the runtime.
- Create an
InferenceEngineinstance. - Load a model from the built-in registry (or a local GGUF file).
- Run inference and inspect the
InferResult.
#Step 1: Install llamatelemetry
!pip install -q --no-cache-dir --force-reinstall git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.1
import llamatelemetry as lt
print("\nllamatelemetry version:", llamatelemetry.__version__)
# Step 2: verify CUDA
cuda_info = lt.detect_cuda()
print(f"CUDA available: {cuda_info['available']}")
for gpu in cuda_info["gpus"]:
print(f" {gpu['name']} - {gpu['memory']} MB")
# Step 3: create engine
engine = lt.InferenceEngine(enable_telemetry=False)
# Step 4: load model (downloads on first run)
engine.load_model("gemma-3-1b-Q4_K_M", auto_start=True)
# Step 5: run inference
result = engine.infer("Explain GPU tensor cores in two sentences.")
print(result.text)
print(f"Tokens/sec: {result.tokens_per_sec:.1f}")
After you have seen a successful result, continue to the detailed guides for model management, telemetry configuration, multi-GPU inference, and more.
Core dependencies¶
llamatelemetry keeps its required dependency footprint small:
| Package | Role |
|---|---|
numpy |
Numerical operations and tensor utilities |
requests |
HTTP communication with llama-server |
huggingface_hub |
Model downloads from Hugging Face |
tqdm |
Progress bars for downloads and batch operations |
opentelemetry-api |
Telemetry span and metric API |
opentelemetry-sdk |
Telemetry SDK (exporters, processors) |
Optional dependency groups (installable via extras) add support for OTLP exporters, Graphistry visualization, pandas DataFrames, Jupyter widgets, PyTorch, pynvml GPU monitoring, SSE streaming, and Weights & Biases logging.
Choose your path¶
Depending on your environment and goals, start with the page that matches best:
Local workstation or cloud VM¶
Follow the standard installation and quickstart:
- Installation -- system prerequisites, pip install, environment variables, optional extras, and troubleshooting.
- Quickstart -- end-to-end tutorial from GPU verification through batch inference, streaming, the low-level client API, chat completions, embeddings, and cleanup.
Kaggle notebook (T4 x2)¶
Jump directly to the Kaggle-specific guide:
- Kaggle Quickstart -- notebook setup,
ServerPresetconfiguration,split_gpu_session, OTLP secrets, the full Kaggle pipeline, and recommended models for T4 VRAM budgets.
Deep dives after getting started¶
Once you are running inference, explore these next:
- Inference Engine Guide -- advanced
InferenceEngineusage, context management, error handling. - Server Management --
ServerManagerlifecycle, health checks, port configuration. - Model Management -- the model registry, Hugging Face downloads, GGUF format details.
- Telemetry and Observability -- OpenTelemetry setup, 45 Gen AI attributes, Grafana dashboards.
- API Client Reference --
LlamaCppClientfor chat completions, embeddings, tokenization. - Kaggle Environment Guide -- advanced Kaggle patterns, secrets management, GPU splitting.
- Notebook Hub -- 18 production-tested Kaggle notebooks covering every major workflow.
Getting help¶
- FAQ: Frequently Asked Questions
- Troubleshooting: Common issues and fixes
- GitHub Issues: github.com/llamatelemetry/llamatelemetry/issues
- Changelog: Release history