04: GGUF Quantization¶
Overview¶
Phase: Foundation Difficulty: Beginner Duration: 20 min
Learn gguf quantization for LLM inference on Kaggle dual Tesla T4 GPUs.
Learning Objectives¶
- ✅ Understand quantization types (K-quants, I-quants)
- ✅ Calculate VRAM requirements
- ✅ Select optimal quantization for Kaggle T4
- ✅ Benchmark different quantizations
- ✅ Learn GGUF file structure
Topics Covered¶
- 📚 29 quantization types
- 📚 VRAM estimation
- 📚 Quality vs size
Prerequisites¶
- llamatelemetry v0.1.0 installed
- Kaggle dual Tesla T4 environment (30GB VRAM)
- Basic Python knowledge
- Completed Tutorial 01
Quick Start¶
# Install llamatelemetry
!pip install -q --no-cache-dir git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0
# Verify GPU environment
!nvidia-smi --query-gpu=index,name,memory.total --format=csv
Key Concepts¶
Quantization Types¶
K-Quants (Recommended): - Q4_K_M - Best balance (4.85 bits/weight) - Q5_K_M - Higher quality (5.69 bits/weight) - Q6_K - Near lossless (6.59 bits/weight)
I-Quants (Large Models): - IQ3_XS - For 70B models on dual T4 (3.30 bits/weight) - IQ4_XS - Better quality (4.25 bits/weight)
Step-by-Step Guide¶
Step 1: Environment Setup¶
Verify your Kaggle environment has dual Tesla T4 GPUs.
Step 2: Installation¶
Install llamatelemetry v0.1.0 with all dependencies.
Step 3: Configuration¶
Configure the server for optimal performance on Kaggle T4.
Step 4: Implementation¶
Implement the tutorial objectives step by step.
Step 5: Verification¶
Test and verify the implementation works correctly.
Expected Output¶
After completing this tutorial, you should be able to:
- ✅ Understand quantization types (K-quants, I-quants)
- ✅ Calculate VRAM requirements
- ✅ Select optimal quantization for Kaggle T4
- ✅ Benchmark different quantizations
- ✅ Learn GGUF file structure
Common Issues¶
Issue: Server Fails to Start¶
Solution: Check GPU memory and ensure no other processes are using the GPUs.
Issue: Out of Memory¶
Solution: Reduce context size or use lower quantization.
Performance Benchmarks¶
Expected performance on Kaggle dual Tesla T4:
| Model | Quantization | Speed | VRAM |
|---|---|---|---|
| Gemma-3 1B | Q4_K_M | ~85 tok/s | ~1 GB |
| Gemma-3 4B | Q4_K_M | ~42 tok/s | ~2.5 GB |
| Llama-3.1 8B | Q4_K_M | ~25 tok/s | ~5 GB |
Next Steps¶
- Continue to Tutorial 05
- Explore the API Reference
- Read the Architecture Guide
- Check the Troubleshooting Guide
Resources¶
Full Notebook¶
View and run the complete notebook on Kaggle:
Tutorial {num}/{len(NOTEBOOKS)} | llamatelemetry v0.1.0 | Back to Tutorials