Kaggle Quickstart¶
This guide is purpose-built for Kaggle notebooks with GPU T4 x2 accelerator. llamatelemetry includes Kaggle-specific presets, pipeline helpers, secrets integration, and a split-GPU context manager that dedicates one T4 to inference and the other to analytics workloads like Graphistry visualization.
Every code block below is designed to run as a Kaggle notebook cell.
1. Enable GPU accelerator¶
Before running any code, configure the Kaggle notebook:
- Open your notebook in Kaggle.
- Click Settings (right sidebar) or the three-dot menu.
- Under Accelerator, select GPU T4 x2.
- Save and restart the session.
Both Tesla T4 GPUs provide 16 GB VRAM each (SM 7.5, compute capability 7.5).
2. Install llamatelemetry¶
Run this as the first code cell. The -q flag suppresses pip output to keep the
notebook clean:
!pip -q install --no-cache-dir --force-reinstall \
git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.1
For optional dependencies needed by specific workflows:
# Telemetry export to Grafana Cloud or other OTLP backends
!pip -q install opentelemetry-exporter-otlp-proto-http
# Graphistry visualization (requires Graphistry API key)
!pip -q install pygraphistry pandas
# Weights & Biases logging
!pip -q install wandb
# SSE streaming support
!pip -q install sseclient-py
3. Verify CUDA and GPU visibility¶
Confirm both T4 GPUs are visible to the SDK:
import llamatelemetry as lt
cuda_info = lt.detect_cuda()
print(f"CUDA available: {cuda_info['available']}")
print(f"CUDA version: {cuda_info['version']}")
print(f"GPU count: {len(cuda_info['gpus'])}")
for i, gpu in enumerate(cuda_info["gpus"]):
print(f"\n GPU {i}: {gpu['name']}")
print(f" VRAM: {gpu['memory']} MB")
print(f" Driver: {gpu['driver_version']}")
print(f" Compute capability: {gpu['compute_capability']}")
Expected output:
CUDA available: True
CUDA version: 12.2
GPU count: 2
GPU 0: Tesla T4
VRAM: 15360 MB
Driver: 535.104.05
Compute capability: 7.5
GPU 1: Tesla T4
VRAM: 15360 MB
Driver: 535.104.05
Compute capability: 7.5
4. ServerPreset for Kaggle¶
The ServerPreset enum provides pre-tuned configurations for common GPU
environments. Each preset sets appropriate values for GPU layers, context size,
number of parallel slots, and split mode:
from llamatelemetry.kaggle import ServerPreset
# Available presets
print(ServerPreset.KAGGLE_DUAL_T4) # Dual T4 with layer splitting
print(ServerPreset.KAGGLE_SINGLE_T4) # Single T4, all layers on GPU 0
print(ServerPreset.COLAB_T4) # Google Colab T4 environment
Start a server from a preset¶
The start_server_from_preset() function creates a ServerManager, applies
the preset configuration, loads the specified model, and starts the server in
one call:
from llamatelemetry.kaggle import start_server_from_preset, ServerPreset
server = start_server_from_preset(
preset=ServerPreset.KAGGLE_DUAL_T4,
model_name="gemma-3-1b-Q4_K_M",
)
print(f"Server running at: {server.server_url}")
print(f"Server PID: {server.pid}")
The dual-T4 preset splits model layers across both GPUs, effectively doubling
the available VRAM for larger models. For models that fit on a single GPU, use
KAGGLE_SINGLE_T4 to keep GPU 1 free for analytics.
5. Basic inference on Kaggle¶
Once the server is running, create an InferenceEngine and run inference:
engine = lt.InferenceEngine(
server_url="http://127.0.0.1:8080",
enable_telemetry=False,
)
engine.load_model("gemma-3-1b-Q4_K_M", auto_start=True)
result = engine.infer(
"Summarize the benefits of running LLMs on Kaggle T4 GPUs.",
max_tokens=128,
temperature=0.7,
)
print(result.text)
print(f"\nTokens generated: {result.tokens_generated}")
print(f"Latency: {result.latency_ms:.0f} ms")
print(f"Throughput: {result.tokens_per_sec:.1f} tokens/sec")
6. Split-GPU session¶
The split_gpu_session context manager assigns GPUs to different workloads.
This is the recommended pattern when you need inference on one GPU and
analytics (Graphistry, cuGraph, or other CUDA workloads) on the other:
from llamatelemetry.kaggle import split_gpu_session
with split_gpu_session(llm_gpu=0, graph_gpu=1):
# GPU 0: inference workload
engine = lt.InferenceEngine(enable_telemetry=False)
engine.load_model("gemma-3-1b-Q4_K_M", auto_start=True)
result = engine.infer("What is GPU memory splitting?", max_tokens=96)
print(result.text)
# GPU 1 is available for Graphistry, cuGraph, or other CUDA work
# Example: pygraphistry visualization would use GPU 1 here
engine.unload_model()
Inside the context, CUDA_VISIBLE_DEVICES is configured so that the LLM
server sees only llm_gpu and analytics libraries see only graph_gpu. When
the context exits, the original GPU visibility is restored.
7. KagglePipelineConfig¶
For more complex setups, KagglePipelineConfig bundles all configuration into a
single object:
from llamatelemetry.kaggle import KagglePipelineConfig
config = KagglePipelineConfig(
model_name="gemma-3-1b-Q4_K_M",
preset=ServerPreset.KAGGLE_DUAL_T4,
enable_telemetry=True,
service_name="kaggle-llm-experiment",
llm_gpu=0,
graph_gpu=1,
)
print(f"Model: {config.model_name}")
print(f"Preset: {config.preset}")
print(f"Telemetry: {config.enable_telemetry}")
print(f"LLM GPU: {config.llm_gpu}")
print(f"Graph GPU: {config.graph_gpu}")
Use the config to drive the full pipeline:
from llamatelemetry.kaggle import start_server_from_preset
server = start_server_from_preset(
preset=config.preset,
model_name=config.model_name,
)
engine = lt.InferenceEngine(
enable_telemetry=config.enable_telemetry,
telemetry_config={
"service_name": config.service_name,
"enable_llama_metrics": True,
},
)
engine.load_model(config.model_name, auto_start=False) # server already running
result = engine.infer("Explain KV cache eviction.", max_tokens=96)
print(result.text)
8. OTLP telemetry from Kaggle secrets¶
Kaggle provides a secrets store for API keys and credentials. The
load_grafana_otlp_env_from_kaggle() function reads OTLP endpoint and
authentication credentials from Kaggle secrets and sets the corresponding
environment variables:
from llamatelemetry.kaggle import load_grafana_otlp_env_from_kaggle
# Reads from Kaggle secrets:
# GRAFANA_OTLP_ENDPOINT
# GRAFANA_OTLP_TOKEN
# and sets:
# OTEL_EXPORTER_OTLP_ENDPOINT
# OTEL_EXPORTER_OTLP_HEADERS
load_grafana_otlp_env_from_kaggle()
Setting up Kaggle secrets¶
Before using this function, add the following secrets in your Kaggle notebook settings:
- GRAFANA_OTLP_ENDPOINT -- Your Grafana Cloud OTLP endpoint
(e.g.,
https://otlp-gateway-prod-us-central-0.grafana.net/otlp). - GRAFANA_OTLP_TOKEN -- A base64-encoded
instance_id:api_keystring for Grafana Cloud authentication.
To add secrets: Settings (right sidebar) > Secrets > Add a new secret.
9. Full OTEL + client setup¶
The setup_otel_and_client() helper combines telemetry initialization and
client creation into a single call:
from llamatelemetry.kaggle import (
load_grafana_otlp_env_from_kaggle,
setup_otel_and_client,
start_server_from_preset,
ServerPreset,
)
# Step 1: Load OTLP credentials from Kaggle secrets
load_grafana_otlp_env_from_kaggle()
# Step 2: Start the server
server = start_server_from_preset(
preset=ServerPreset.KAGGLE_DUAL_T4,
model_name="gemma-3-1b-Q4_K_M",
)
# Step 3: Initialize OTEL and create an instrumented client
otel_client = setup_otel_and_client(
service_name="kaggle-observability-demo",
server_url="http://127.0.0.1:8080",
)
# Now all requests through otel_client emit traces and metrics
result = otel_client.chat_completions(
messages=[{"role": "user", "content": "What is OpenTelemetry?"}],
max_tokens=128,
)
print(result["choices"][0]["message"]["content"])
Traces and metrics are automatically exported to the configured OTLP endpoint. View them in Grafana Cloud dashboards or any OTLP-compatible backend.
10. Weights & Biases integration¶
For experiment tracking, combine llamatelemetry with Weights & Biases:
import wandb
# Initialize W&B (reads WANDB_API_KEY from Kaggle secrets)
wandb.init(
project="llamatelemetry-kaggle",
config={
"model": "gemma-3-1b-Q4_K_M",
"max_tokens": 128,
"temperature": 0.7,
"gpu": "T4 x2",
},
)
engine = lt.InferenceEngine(enable_telemetry=False)
engine.load_model("gemma-3-1b-Q4_K_M", auto_start=True)
prompts = [
"Explain flash attention.",
"What is model quantization?",
"How does continuous batching work?",
]
for prompt in prompts:
result = engine.infer(prompt, max_tokens=128, temperature=0.7)
# Log each inference result to W&B
wandb.log({
"prompt": prompt,
"tokens_generated": result.tokens_generated,
"latency_ms": result.latency_ms,
"tokens_per_sec": result.tokens_per_sec,
"success": result.success,
})
print(f"{result.tokens_per_sec:.1f} tok/s | {prompt[:50]}")
engine.unload_model()
wandb.finish()
Add WANDB_API_KEY to Kaggle secrets for automatic authentication.
11. Recommended models for T4 VRAM¶
Each Tesla T4 has 16 GB VRAM. The following table shows recommended models from the built-in registry and their approximate VRAM requirements at Q4_K_M quantization:
| Model | Parameters | VRAM (approx.) | Fits on | Notes |
|---|---|---|---|---|
gemma-3-1b-Q4_K_M |
1B | ~1.5 GB | Single T4 | Fast prototyping, default small model |
qwen2.5-1.5b-Q4_K_M |
1.5B | ~2 GB | Single T4 | Good instruction following |
gemma-2-2b-Q4_K_M |
2B | ~2.5 GB | Single T4 | Balanced quality/speed |
phi-3-mini-Q4_K_M |
3.8B | ~3 GB | Single T4 | Strong reasoning for size |
llama-3.2-3b-Q4_K_M |
3B | ~2.5 GB | Single T4 | Meta Llama family |
mistral-7b-Q4_K_M |
7B | ~5 GB | Single T4 | Strong general-purpose |
llama-3.1-8b-Q4_K_M |
8B | ~6 GB | Single T4 | Excellent quality |
gemma-2-9b-Q4_K_M |
9B | ~7 GB | Single T4 | High quality, fits T4 |
llama-3.1-70b-Q4_K_M |
70B | ~40 GB | Dual T4 (split) | Requires layer splitting across both GPUs |
For models larger than 16 GB, use ServerPreset.KAGGLE_DUAL_T4 to split layers
across both GPUs. The 70B model at Q4 quantization requires both T4s and leaves
minimal headroom, so use a smaller context size.
Choosing a model¶
- Quick experiments:
gemma-3-1b-Q4_K_M(fastest load, minimal VRAM). - Quality inference:
llama-3.1-8b-Q4_K_Morgemma-2-9b-Q4_K_M. - Split-GPU with analytics: Use a single-T4 model and keep GPU 1 free for Graphistry or cuGraph.
- Maximum model size:
llama-3.1-70b-Q4_K_Mwith dual-T4 splitting (tight VRAM budget, reducectx_size).
12. Complete Kaggle notebook example¶
This cell-by-cell example combines everything into a production-ready Kaggle workflow:
Cell 1: Install¶
!pip -q install --no-cache-dir --force-reinstall \
git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.1
!pip -q install opentelemetry-exporter-otlp-proto-http
Cell 2: Verify environment¶
import llamatelemetry as lt
cuda_info = lt.detect_cuda()
assert cuda_info["available"], "CUDA not available"
assert len(cuda_info["gpus"]) >= 2, "Need 2 GPUs for dual-T4 preset"
print(f"Ready: {len(cuda_info['gpus'])} x {cuda_info['gpus'][0]['name']}")
lt.setup_environment()
Cell 3: Configure pipeline¶
from llamatelemetry.kaggle import (
KagglePipelineConfig,
ServerPreset,
load_grafana_otlp_env_from_kaggle,
split_gpu_session,
)
# Load telemetry credentials (optional, skip if not using Grafana)
# load_grafana_otlp_env_from_kaggle()
config = KagglePipelineConfig(
model_name="gemma-3-1b-Q4_K_M",
preset=ServerPreset.KAGGLE_DUAL_T4,
enable_telemetry=False,
llm_gpu=0,
graph_gpu=1,
)
Cell 4: Run inference¶
with split_gpu_session(llm_gpu=config.llm_gpu, graph_gpu=config.graph_gpu):
engine = lt.InferenceEngine(enable_telemetry=config.enable_telemetry)
engine.load_model(config.model_name, auto_start=True)
# Single inference
result = engine.infer(
"What are the key differences between FP16 and INT8 quantization?",
max_tokens=128,
temperature=0.7,
)
print(result.text)
print(f"\n{result.tokens_per_sec:.1f} tokens/sec | "
f"{result.latency_ms:.0f} ms | "
f"{result.tokens_generated} tokens")
# Batch inference
prompts = [
"Explain KV cache in one paragraph.",
"What is flash attention?",
"How does GGUF quantization preserve model quality?",
]
results = engine.batch_generate(prompts, max_tokens=96)
for i, r in enumerate(results):
print(f"\n--- Result {i + 1} ({r.tokens_per_sec:.1f} tok/s) ---")
print(r.text)
engine.unload_model()
Cell 5: Review metrics¶
Recommended notebooks¶
For deeper Kaggle workflows, explore these production-tested notebooks:
- 01 Quickstart -- Foundation setup and first inference.
- 03 Multi-GPU Inference -- Dual-T4 layer splitting and multi-GPU configurations.
- 06 Split-GPU Graphistry -- LLM on GPU 0, graph visualization on GPU 1.
- 09 Large Models Kaggle -- Running 70B models on dual T4 with aggressive quantization.
- 14 OpenTelemetry Observability -- Full OTLP pipeline on Kaggle.
- 17 W&B Integration -- Weights & Biases experiment tracking.
Next steps¶
- Quickstart -- general-purpose quickstart for local workstations.
- Kaggle Environment Guide -- advanced Kaggle patterns, secrets management, and preset customization.
- Telemetry and Observability -- OpenTelemetry configuration, Gen AI attributes, and dashboards.
- Graphistry and RAPIDS -- GPU-accelerated graph visualization on the second T4.