Troubleshooting¶
This guide covers common issues you may encounter when installing, configuring, and using llamatelemetry. Each section describes symptoms, causes, and fixes.
Installation Issues¶
Package Fails to Install¶
Symptoms: pip install llamatelemetry fails with build errors.
Fixes:
- Ensure you are using Python 3.11 or later:
python3 --version - Upgrade pip and setuptools:
pip install --upgrade pip setuptools - If CMake build fails for the C++/CUDA extension, install CMake:
pip install cmake - On Kaggle, install in a single cell at the top of the notebook to avoid dependency conflicts
C++/CUDA Extension Build Failure¶
Symptoms: Build errors mentioning nvcc, cudart, or cublas during installation.
Fixes:
- Verify CUDA toolkit is installed:
nvcc --version - Ensure
CUDA_HOMEorCUDA_PATHis set: - For Tesla T4, confirm CUDA 12.x is installed (SM 7.5 support is required)
- If building from source, ensure CMake can find CUDA:
cmake -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc ..
Import Errors After Install¶
Symptoms: import llamatelemetry fails with ModuleNotFoundError for dependencies.
Fixes:
- Install core dependencies:
pip install numpy requests huggingface_hub tqdm opentelemetry-api opentelemetry-sdk - For optional modules, install their specific dependencies:
CUDA Detection Issues¶
CUDA Not Detected¶
Symptoms:
detect_cuda()returnsavailable: Falsetorch.cuda.is_available()returnsFalsellama-serverfails to start with GPU support
Fixes:
-
Verify the NVIDIA driver is installed and GPUs are visible:
If this command fails, install or update NVIDIA drivers. -
Check that the CUDA toolkit version matches your driver:
-
Verify you are not in a CPU-only container or virtual environment.
-
On Kaggle, ensure the notebook accelerator is set to GPU (T4 x2) in Settings.
-
If using PyTorch, ensure you have the CUDA-enabled build:
Wrong GPU Selected¶
Symptoms: Operations run on GPU 0 when you expect GPU 1, or vice versa.
Fixes:
- Set the device explicitly:
- In Python, specify the device:
Server Startup Problems¶
llama-server Not Found¶
Symptoms:
ServerManager.find_llama_server()returnsNoneInferenceEngine.load_model()raises a runtime error about missing server binary
Fixes:
-
Set the path to your llama-server binary:
-
If you built llama.cpp manually, set the build directory:
-
Reinstall the package to trigger the bootstrap binary download:
-
Verify the binary is executable:
Server Fails to Start¶
Symptoms: Server process starts but immediately exits or hangs.
Fixes:
- Check the server log output for error messages
- Ensure the model file exists and is readable:
- Verify sufficient VRAM is available:
- Try starting with minimal options:
- Check if another process is using the default port (8080): Use a different port if needed:
Server Responds with Errors¶
Symptoms: Server starts but returns HTTP 500 or empty responses.
Fixes:
- Wait for model loading to complete before sending requests (the engine handles this automatically, but manual server use requires patience)
- Check server health:
- Ensure the model format is correct (must be GGUF)
- Try reducing context size if the model is too large for available VRAM
Out of Memory (OOM) Errors¶
GPU OOM During Inference¶
Symptoms: CUDA out of memory errors during model loading or inference.
Fixes:
-
Use a more aggressively quantized model:
-
Reduce context length:
-
Reduce batch size:
-
Offload some layers to CPU:
-
Free unused GPU memory:
-
Monitor VRAM usage:
CPU OOM During Export¶
Symptoms: Process killed or MemoryError during GGUF export.
Fixes:
- LoRA adapter merging temporarily doubles memory usage. Use
load_in_4bit=Trueto reduce the base footprint. - Close other memory-intensive applications.
- On Kaggle, the notebook runtime has limited RAM. Consider exporting on a local machine with more memory.
Tesla T4 VRAM Guidelines¶
| Model Size | Quantization | Approx. VRAM | Fits on T4 (16 GB)? |
|---|---|---|---|
| 1B | Q4_K_M | ~0.8 GB | Yes |
| 3B | Q4_K_M | ~2.2 GB | Yes |
| 7B | Q4_K_M | ~4.1 GB | Yes |
| 7B | Q8_0 | ~7.2 GB | Yes |
| 13B | Q4_K_M | ~7.4 GB | Yes (tight) |
| 13B | Q8_0 | ~14 GB | Barely |
| 70B | Q4_K_M | ~38 GB | No (need multi-GPU) |
Multi-GPU Issues¶
GPUs Not Detected¶
Symptoms: MultiGPUConfig shows fewer GPUs than expected.
Fixes:
- Verify all GPUs are visible:
- Check
CUDA_VISIBLE_DEVICESis not restricting visibility - Ensure all GPUs have compatible drivers
Split-Mode Errors¶
Symptoms: Multi-GPU inference fails or produces garbage output.
Fixes:
- Ensure both GPUs have the same architecture (e.g., both Tesla T4)
- Check that NCCL can communicate between GPUs:
- Try layer split mode first (simpler than tensor split):
- On Kaggle dual-T4, use the recommended split-GPU session:
NCCL Communication Failures¶
Symptoms: Hangs or errors mentioning NCCL during multi-GPU operations.
Fixes:
- Set NCCL environment variables for debugging:
- Ensure
LD_LIBRARY_PATHincludes the NCCL library directory - Try disabling specific transports to isolate the issue:
Telemetry and OpenTelemetry Issues¶
OpenTelemetry Not Available¶
Symptoms:
setup_telemetry()returns(None, None)- Telemetry spans are not being collected
Fixes:
Install the OpenTelemetry packages:
pip install opentelemetry-api opentelemetry-sdk
pip install opentelemetry-exporter-otlp-proto-grpc
pip install opentelemetry-exporter-otlp-proto-http
OTLP Exporter Connection Refused¶
Symptoms: Telemetry data is not reaching your collector. Errors mentioning connection refused or timeout.
Fixes:
-
Verify the collector is running and accessible:
-
Check the endpoint configuration:
-
If using Jaeger or Grafana, ensure their OTLP receivers are enabled.
Missing Telemetry Attributes¶
Symptoms: Spans are created but gen_ai.* attributes are missing.
Fixes:
- Ensure you are using the instrumented inference methods (not raw HTTP calls)
- Check that the telemetry module was properly initialized before making inference calls
- Verify the OpenTelemetry SDK version is compatible (0.40+ recommended)
Model Download Issues¶
Registry Download Fails¶
Symptoms: Model download hangs, times out, or produces corrupted files.
Fixes:
- Verify internet connectivity in your runtime environment
- Provide a local GGUF path instead of a model name:
- For gated models, set your HuggingFace token: Or:
- If the download was interrupted, delete the partial file and retry
- Check available disk space -- GGUF files can be several gigabytes
Bootstrap Binary Download Fails¶
Symptoms: First import llamatelemetry fails to download the llama-server binary (~961 MB).
Fixes:
- Ensure you have a stable internet connection and sufficient disk space
- Check if a proxy or firewall is blocking the download
- Set a custom download location:
- Download the binary manually and set
LLAMA_SERVER_PATH
Missing Shared Libraries¶
Symptoms: llama-server or the C++ extension fails with errors about missing .so files (e.g., libnccl.so, libcublas.so).
Fixes:
-
Ensure
LD_LIBRARY_PATHincludes the llamatelemetry lib directory: -
Re-import llamatelemetry to re-run bootstrap:
-
Verify CUDA libraries are on the library path:
-
On Kaggle, the CUDA libraries are in
/usr/local/cuda/lib64-- ensure this is in your path.
Kaggle-Specific Issues¶
Accelerator Not Set¶
Symptoms: No GPU detected in the Kaggle notebook.
Fix: Go to Settings (right sidebar) and set Accelerator to GPU T4 x2.
Package Installation Order¶
Symptoms: Import errors or version conflicts after installing packages.
Best Practice: Install all packages in a single cell at the top of the notebook:
Avoid running pip install in multiple cells, as this can cause dependency resolution issues.
Disk Space Limits¶
Symptoms: Downloads fail or the kernel crashes due to insufficient disk space.
Fixes:
- Kaggle provides ~20 GB of disk space. Large models and binaries can exhaust this.
- Use smaller quantized models (Q4_K_M instead of Q8_0)
- Clean up temporary files:
- Add model files as Kaggle datasets rather than downloading them each run
Session Timeout¶
Symptoms: Long-running export or inference tasks are interrupted.
Fix: Kaggle sessions time out after extended idle periods. Keep the notebook active or break long tasks into smaller cells with intermediate saves.
Common Error Messages¶
| Error | Cause | Fix |
|---|---|---|
RuntimeError: CUDA out of memory |
Model too large for VRAM | Use smaller model or more aggressive quantization |
ConnectionRefusedError: [Errno 111] |
Server not running | Start the server or check the port |
FileNotFoundError: llama-server |
Binary not found | Set LLAMA_SERVER_PATH |
ImportError: No module named 'unsloth' |
Unsloth not installed | pip install unsloth |
ImportError: No module named 'triton' |
Triton not installed | pip install triton |
json.JSONDecodeError |
Malformed server response | Check server health, restart if needed |
TimeoutError |
Server not responding | Increase timeout or check server load |
Warning: CUDA not available |
No GPU or wrong PyTorch build | Install CUDA-enabled PyTorch |
Diagnostic Checklist¶
When reporting an issue, gather the following information:
import sys
import torch
print(f"Python: {sys.version}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name()}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
import llamatelemetry
print(f"llamatelemetry version: {llamatelemetry.__version__}")
Getting Help¶
If the troubleshooting steps above do not resolve your issue:
- Check the API Reference for detailed parameter documentation
- Review the Notebook Hub for working examples
- Inspect the
tests/directory in the source repository for runnable verification patterns - Search existing issues on GitHub
- File a new issue with the diagnostic checklist output above