01: Quick Start¶

Overview¶

Phase: Foundation Difficulty: Beginner Duration: 10 min

Learn quick start for LLM inference on Kaggle dual Tesla T4 GPUs.

Learning Objectives¶

✅ Verify Kaggle dual T4 environment
✅ Install llamatelemetry v0.1.0
✅ Download and load a GGUF model
✅ Start llama-server
✅ Run first inference
✅ Use streaming responses

Topics Covered¶

📚 Basic inference
📚 Model loading
📚 First steps

Prerequisites¶

llamatelemetry v0.1.0 installed
Kaggle dual Tesla T4 environment (30GB VRAM)
Basic Python knowledge

Quick Start¶

# Install llamatelemetry
!pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

# Verify GPU environment
!nvidia-smi --query-gpu=index,name,memory.total --format=csv

Key Concepts¶

Server Configuration¶

llamatelemetry uses llama.cpp's llama-server for high-performance GGUF model inference.

from llamatelemetry.server import ServerManager

server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    ctx_size=4096,
    tensor_split="0.5,0.5"  # Dual GPU split
)

Step-by-Step Guide¶

Step 1: Environment Setup¶

Verify your Kaggle environment has dual Tesla T4 GPUs.

Step 2: Installation¶

Install llamatelemetry v0.1.0 with all dependencies.

Step 3: Configuration¶

Configure the server for optimal performance on Kaggle T4.

Step 4: Implementation¶

Implement the tutorial objectives step by step.

Step 5: Verification¶

Test and verify the implementation works correctly.

Expected Output¶

After completing this tutorial, you should be able to:

✅ Verify Kaggle dual T4 environment
✅ Install llamatelemetry v0.1.0
✅ Download and load a GGUF model
✅ Start llama-server
✅ Run first inference
✅ Use streaming responses

Common Issues¶

Issue: Server Fails to Start¶

Solution: Check GPU memory and ensure no other processes are using the GPUs.

nvidia-smi

Issue: Out of Memory¶

Solution: Reduce context size or use lower quantization.

ctx_size=2048  # Instead of 4096

Performance Benchmarks¶

Expected performance on Kaggle dual Tesla T4:

Model	Quantization	Speed	VRAM
Gemma-3 1B	Q4_K_M	~85 tok/s	~1 GB
Gemma-3 4B	Q4_K_M	~42 tok/s	~2.5 GB
Llama-3.1 8B	Q4_K_M	~25 tok/s	~5 GB

Next Steps¶

Continue to Tutorial 02
Explore the API Reference
Read the Architecture Guide
Check the Troubleshooting Guide

Resources¶

Full Notebook¶

View and run the complete notebook on Kaggle:

Tutorial {num}/{len(NOTEBOOKS)} | llamatelemetry v0.1.0 | Back to Tutorials