09 Large Models on Kaggle¶

Source: notebooks/09-large-models-kaggle-llamatelemetry-e3.ipynb

Notebook focus¶

This page is a cell-by-cell walkthrough of the notebook, explaining the intent of each step and showing the exact code executed.

Cell-by-cell walkthrough¶

Cell 1 (Markdown)¶

09 Large Models on Kaggle¶

Use server presets and suitability checks to run large GGUF models on Kaggle's dual-T4 environment.

What you will learn: - Check model suitability before loading - Use ServerPreset for optimized server configuration - Load large models with preset-derived kwargs

Requirements: Kaggle T4 x2 with a large GGUF model dataset.

Cell 2 (Markdown)¶

1) Install¶

Cell 3 (Code)¶

Summary: Installs required dependencies and runtime tools.

!pip -q install git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.1

Cell 4 (Markdown)¶

2) Check GPU resources¶

Cell 5 (Code)¶

Summary: Imports core libraries: llamatelemetry.

from llamatelemetry import detect_cuda

cuda_info = detect_cuda()
print(f"GPUs: {len(cuda_info.get('gpus', []))}")
for gpu in cuda_info.get('gpus', []):
    print(f"  {gpu}")

Cell 6 (Markdown)¶

3) Run suitability check¶

Before loading a large model, verify it fits in dual-T4 VRAM.

Cell 7 (Code)¶

Summary: Imports core libraries: json, llamatelemetry. Works with GGUF models, quantization, or metadata.

import json
from llamatelemetry.api.gguf import report_model_suitability

model_path = "/kaggle/input/your-large-model/model.gguf"

suitability = report_model_suitability(model_path, ctx_size=8192, dual_t4=True)
print(json.dumps(suitability, indent=2, default=str))

Cell 8 (Markdown)¶

4) Available presets¶

Preset	Target
`KAGGLE_DUAL_T4`	Kaggle dual T4 (32 GB total)
`KAGGLE_SINGLE_T4`	Kaggle single T4 (16 GB)
`COLAB_T4`	Colab T4
`COLAB_A100`	Colab A100
`LOCAL_3090`	Local RTX 3090
`LOCAL_4090`	Local RTX 4090
`CPU_ONLY`	CPU-only fallback

Cell 9 (Code)¶

Summary: Imports core libraries: llamatelemetry.

from llamatelemetry.kaggle import ServerPreset, get_preset_config

preset = get_preset_config(ServerPreset.KAGGLE_DUAL_T4)
print(f"Preset: {ServerPreset.KAGGLE_DUAL_T4.value}")
print(f"Server kwargs: {preset.to_server_kwargs()}")
print(f"Load kwargs:   {preset.to_load_kwargs()}")

Cell 10 (Markdown)¶

5) Load the model with preset config¶

Cell 11 (Code)¶

Summary: Imports core libraries: llamatelemetry. Creates or uses the high-level InferenceEngine to run GGUF inference. Loads a GGUF model (from registry, HF, or local path) and applies runtime settings. Runs inference and captures the generated output.

import llamatelemetry as lt

engine = lt.InferenceEngine(enable_telemetry=False)
engine.load_model(
    model_path,
    auto_start=True,
    **preset.to_load_kwargs(),
)

result = engine.generate("Summarize the model setup", max_tokens=64)
print(f"Tokens/sec: {result.tokens_per_sec:.1f}")
print(result.text)

Cell 12 (Markdown)¶

6) Cleanup¶

Cell 13 (Code)¶

Summary: Cleans up or shuts down running resources.

engine.unload_model()
print("Done.")