Model Management¶

llamatelemetry provides a complete model management system for discovering, downloading, inspecting, and selecting GGUF models. The system includes a curated registry of 22 models, smart downloading from HuggingFace Hub, GGUF metadata inspection, and VRAM-aware recommendations.

Overview¶

The model management layer consists of:

MODEL_REGISTRY -- a built-in catalog of 22 pre-tested GGUF models
SmartModelDownloader -- handles downloads from HuggingFace with caching
ModelInfo -- parses GGUF metadata and recommends GPU settings
ModelManager -- manages local model storage and lookup
load_model_smart() -- unified loader that resolves registry names, local paths, and HuggingFace references

The Model Registry¶

The built-in MODEL_REGISTRY contains 22 curated models that have been tested with llamatelemetry:

from llamatelemetry._internal.bootstrap import MODEL_REGISTRY

for name, info in MODEL_REGISTRY.items():
    print(f"{name}: {info.get('file_size_mb', '?')} MB")

Registry Model Categories¶

Category	Models	Size Range
Tiny (< 1B)	TinyLlama, SmolLM	500 MB - 1 GB
Small (1-3B)	Gemma-3-1B, Llama-3.2-1B, Phi-4-mini, Qwen-2.5-1.5B	1 - 2 GB
Medium (3-8B)	Llama-3.1-8B, Mistral-7B, Gemma-2-9B	4 - 6 GB
Large (8B+)	Llama-3.3-70B, DeepSeek, Mixtral	6 - 40+ GB

Recommended Models for Tesla T4 (16 GB VRAM)¶

Model	Quantization	Size	Use Case
`gemma-3-1b-Q4_K_M`	Q4_K_M	~0.8 GB	Fast prototyping, testing
`llama-3.2-1b-Q4_K_M`	Q4_K_M	~0.9 GB	General purpose, lightweight
`phi-4-mini-Q4_K_M`	Q4_K_M	~2.3 GB	Coding tasks, reasoning
`llama-3.1-8b-Q4_K_M`	Q4_K_M	~4.9 GB	High-quality general purpose
`mistral-7b-Q4_K_M`	Q4_K_M	~4.4 GB	Instruction following

Smart Model Loading¶

The load_model_smart() function is the primary way to resolve and download models:

from llamatelemetry.models import load_model_smart

# From registry -- looks up MODEL_REGISTRY, downloads if needed
model_path = load_model_smart("gemma-3-1b-Q4_K_M")

# From local filesystem
model_path = load_model_smart("/home/user/models/custom-model.gguf")

# From HuggingFace -- repo:filename syntax
model_path = load_model_smart(
    "bartowski/gemma-2-2b-it-GGUF:gemma-2-2b-it-Q4_K_M.gguf"
)

Resolution Order¶

Local path -- if the string is a valid filesystem path to an existing .gguf file, use it directly
Registry name -- if it matches a key in MODEL_REGISTRY, download from the registered HuggingFace URL
HuggingFace reference -- if it contains a colon (repo:filename), download from HuggingFace Hub
Error -- raise an exception with suggestions

Download Caching¶

Downloaded models are cached in ~/.cache/llamatelemetry/models/. Subsequent calls to load_model_smart() with the same model name return the cached path immediately.

ModelInfo -- GGUF Metadata Inspection¶

ModelInfo reads GGUF file headers to extract model metadata:

from llamatelemetry.models import ModelInfo

info = ModelInfo.from_file("/path/to/model.gguf")

print(f"Architecture: {info.architecture}")
print(f"Parameters: {info.parameters}")
print(f"Context length: {info.context_length}")
print(f"Quantization: {info.quantization}")
print(f"File size: {info.file_size_mb:.1f} MB")
print(f"Embedding size: {info.embedding_size}")
print(f"Layers: {info.n_layers}")

VRAM Recommendations¶

recommendations = info.get_recommended_settings(vram_gb=16)
print(f"GPU layers: {recommendations['gpu_layers']}")
print(f"Context size: {recommendations['ctx_size']}")
print(f"Estimated VRAM: {recommendations['estimated_vram_gb']:.1f} GB")

The recommendation engine accounts for:

Model size (parameters and quantization)
KV cache memory (scales with context size and layers)
CUDA runtime overhead (~500 MB)
A safety margin to avoid OOM

ModelManager -- Collection Management¶

ModelManager provides a higher-level interface for managing multiple models:

from llamatelemetry.models import ModelManager

mm = ModelManager()

# List all locally available models
local_models = mm.list_local_models()
for m in local_models:
    print(f"{m['name']}: {m['path']}")

# Download a model
path = mm.download("gemma-3-1b-Q4_K_M")

# Remove a cached model
mm.remove("gemma-3-1b-Q4_K_M")

SmartModelDownloader¶

For fine-grained download control:

from llamatelemetry.models import SmartModelDownloader

downloader = SmartModelDownloader()

# Download with progress callback
path = downloader.download(
    repo_id="bartowski/gemma-2-2b-it-GGUF",
    filename="gemma-2-2b-it-Q4_K_M.gguf",
    cache_dir="~/.cache/llamatelemetry/models",
)

The downloader uses huggingface_hub under the hood and supports:

Resume of interrupted downloads
Progress bar display via tqdm
Token-based authentication for gated models
Automatic cache management

Gated Model Authentication¶

Some HuggingFace models require authentication:

import os
os.environ["HF_TOKEN"] = "hf_your_token_here"

# Or set it before downloading
path = load_model_smart("meta-llama/Llama-3.1-8B-GGUF:model-Q4_K_M.gguf")

Listing Available Models¶

from llamatelemetry.models import list_models, list_registry_models

# List all locally cached models
local = list_models()
for m in local:
    print(f"{m['filename']}: {m['file_size_mb']:.0f} MB")

# List all models in the registry
registry = list_registry_models()
for name in registry:
    print(name)

Model Recommendations¶

Get recommendations based on available VRAM:

from llamatelemetry.models import get_model_recommendations

# Get models that fit in 16 GB VRAM
recs = get_model_recommendations(vram_gb=16)
for rec in recs:
    print(f"{rec['name']}: {rec['estimated_vram_gb']:.1f} GB VRAM")

Bootstrap System¶

On first import, llamatelemetry checks for the llama-server binary and essential models. The bootstrap process:

Downloads the llama-server release bundle (~961 MB) if missing
Extracts binaries and CUDA shared libraries to ~/.cache/llamatelemetry/
Configures LD_LIBRARY_PATH for the bundled CUDA libraries
Caches model files in ~/.cache/llamatelemetry/models/

from llamatelemetry._internal.bootstrap import bootstrap

# Force re-bootstrap
bootstrap(force=True)

Best Practices¶

Start with small models (1B parameters) for development and testing on T4 GPUs.
Use registry names for reproducibility -- they pin specific quantization variants.
Check VRAM before loading using ModelInfo.get_recommended_settings().
Cache models locally to avoid repeated downloads on Kaggle.
Use Q4_K_M quantization as the default -- it provides the best quality/size tradeoff.
Set HF_TOKEN for gated models before calling load_model_smart().

Complete Example¶

from llamatelemetry.models import (
    load_model_smart,
    ModelInfo,
    list_registry_models,
    get_model_recommendations,
)

# 1. Check what models fit on our GPU
recs = get_model_recommendations(vram_gb=16)
print("Recommended models for 16 GB VRAM:")
for r in recs[:5]:
    print(f"  {r['name']}")

# 2. List registry models
print("\nRegistry models:")
for name in list_registry_models():
    print(f"  {name}")

# 3. Load a model
model_path = load_model_smart("gemma-3-1b-Q4_K_M")
print(f"\nModel path: {model_path}")

# 4. Inspect metadata
info = ModelInfo.from_file(model_path)
print(f"Architecture: {info.architecture}")
print(f"Quantization: {info.quantization}")
print(f"Context length: {info.context_length}")

# 5. Get optimal settings for our GPU
settings = info.get_recommended_settings(vram_gb=16)
print(f"Recommended GPU layers: {settings['gpu_layers']}")
print(f"Recommended context size: {settings['ctx_size']}")

Inference Engine -- uses model management internally
Quantization -- GGUF quantization details
GGUF API Reference
Server and Models Reference