Troubleshooting¶

Common issues and solutions for llamatelemetry v0.1.0.

Installation Issues¶

Binary Download Fails¶

Symptom: Error during import llamatelemetry or binary download timeout

Solutions:

Clear cache and retry:

import shutil
shutil.rmtree("/root/.cache/llamatelemetry/", ignore_errors=True)

!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

Check internet connection:
```
!ping -c 3 huggingface.co
```
Manual download from HuggingFace Hub

CUDA Not Available¶

Symptom: check_cuda_available() returns False

Solutions:

Verify CUDA installation:
```
!nvcc --version
!nvidia-smi
```
Check Kaggle accelerator settings:
Settings → Accelerator → GPU T4 x2
Restart notebook session

Verify GPU detection:

from llamatelemetry.api.multigpu import gpu_count
print(f"GPUs: {gpu_count()}")

Version Conflicts¶

Symptom: ImportError or version mismatch warnings

Solution: Force reinstall with no cache:

!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0

Server Issues¶

Server Won't Start¶

Symptom: start_server() fails or hangs

Solutions:

Check if port 8080 is in use:

import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = sock.connect_ex(('127.0.0.1', 8080))
print("Port in use" if result == 0 else "Port free")
sock.close()

Use different port:

server.start_server(
    model_path=model_path,
    port=8081,  # Different port
)

Check model file exists:

import os
print(os.path.exists(model_path))

Out of Memory (OOM)¶

Symptom: CUDA out of memory error

Solutions:

Use smaller model:

# Instead of 4B model
model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",  # Use 1B
    filename="gemma-3-1b-it-Q4_K_M.gguf",
)

Reduce context size:

server.start_server(
    model_path=model_path,
    ctx_size=4096,  # Reduce from 8192
)

Check VRAM usage:

!nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv

Slow Inference¶

Symptom: Low tokens/sec throughput

Solutions:

Enable FlashAttention:

server.start_server(
    model_path=model_path,
    flash_attn=1,  # Enable
)

Use GPU 0 only:

server.start_server(
    model_path=model_path,
    tensor_split="1.0,0.0",  # GPU 0 only
)

Verify GPU utilization:

!nvidia-smi --query-gpu=index,utilization.gpu --format=csv

OpenTelemetry Issues¶

No Spans Captured¶

Symptom: get_finished_spans() returns empty list

Solutions:

Attach exporter BEFORE making requests:

# Set up telemetry FIRST
from llamatelemetry.telemetry import setup_telemetry
tracer, meter = setup_telemetry(service_name="llm")

# THEN make requests
with tracer.start_as_current_span("request"):
    response = client.chat.create(...)

Use SimpleSpanProcessor for testing:

from opentelemetry.sdk.trace.export import SimpleSpanProcessor

processor = SimpleSpanProcessor(memory_exporter)
trace.get_tracer_provider().add_span_processor(processor)

Check spans were actually created:

spans = memory_exporter.get_finished_spans()
print(f"Captured {len(spans)} spans")

OTLP Export Fails with 404¶

Symptom: OTLP exporter returns 404 error

Solution: Use explicit endpoint paths:

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317/v1/traces",  # Explicit /v1/traces
    insecure=True
)

Metrics Not Recorded¶

Symptom: GPU metrics not appearing

Solutions:

Verify PyNVML is working:

import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
print(f"VRAM: {info.used / 1024**2:.0f} MB")

Check meter is properly configured:

from llamatelemetry.telemetry import setup_telemetry

tracer, meter = setup_telemetry(service_name="llm")
print(f"Meter: {meter}")

Graphistry Issues¶

Connection Fails¶

Symptom: Graphistry plot fails to render

Solutions:

Verify registration:

import graphistry
graphistry.register(
    api=3,
    protocol="https",
    server="hub.graphistry.com",
    token="YOUR_TOKEN"  # Get from graphistry.com
)

Check internet connection:
```
!ping -c 3 hub.graphistry.com
```

Use GPU 1 explicitly:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1'

# Then import graphistry
import graphistry

No Edges Error¶

Symptom: "DataFrame has no edges" error

Solution: Verify spans have parent-child relationships:

spans = memory_exporter.get_finished_spans()
edges = [(s.parent.span_id, s.context.span_id) for s in spans if s.parent]
print(f"Found {len(edges)} edges")

if len(edges) == 0:
    print("No parent-child relationships found!")

Plot Doesn't Render¶

Symptom: Graphistry returns URL but doesn't display

Solutions:

Open URL manually:

g = TracesGraphistry(spans=spans)
url = g.plot(render=False)
print(f"Open: {url}")

Check Kaggle allows iframes:
Graphistry plots may not render in Kaggle notebooks
Copy URL and open in new tab

Model Issues¶

Model Download Fails¶

Symptom: HuggingFace download timeout or error

Solutions:

Check internet connection:
```
!ping -c 3 huggingface.co
```

Use explicit token:

from huggingface_hub import login
login(token="YOUR_HF_TOKEN")

Try different model repo:

# If unsloth repo fails, try bartowski
model_path = hf_hub_download(
    repo_id="bartowski/gemma-3-4b-it-GGUF",
    filename="gemma-3-4b-it-Q4_K_M.gguf",
)

Wrong Quantization¶

Symptom: Model quality is poor

Solution: Check quantization level:

Q2/Q3: Very low quality, not recommended
Q4: Good balance (recommended)
Q5/Q6: Better quality, slower
Q8: Best quality, much slower

Use Q4_K_M for best balance:

filename="model-Q4_K_M.gguf"  # Recommended

Performance Issues¶

Low Tokens/sec¶

Symptom: Inference is slower than expected

Solutions:

Enable FlashAttention:

server.start_server(model_path=path, flash_attn=1)

Use dedicated GPU (tensor_split):

server.start_server(model_path=path, tensor_split="1.0,0.0")

Reduce batch size for streaming:

server.start_server(
    model_path=path,
    batch_size=512,
    ubatch_size=128,  # Smaller for streaming
)

Check GPU utilization:
```
watch -n 1 nvidia-smi
```

High Memory Usage¶

Symptom: Running out of VRAM

Solutions:

Reduce context size:

server.start_server(model_path=path, ctx_size=4096)

Use more aggressive quantization:

filename="model-Q4_0.gguf"  # Instead of Q5_K_M

Offload some layers to CPU:

server.start_server(
    model_path=path,
    gpu_layers=20,  # Only 20 layers on GPU
)

Client Issues¶

Connection Refused¶

Symptom: Client can't connect to server

Solutions:

Verify server is running:

import requests
response = requests.get("http://127.0.0.1:8080/health")
print(response.status_code)  # Should be 200

Check correct port:

client = LlamaCppClient(base_url="http://127.0.0.1:8080")  # Default port

Wait for server to be ready:

server.start_server(model_path=path)
time.sleep(10)  # Wait for server initialization

Timeout Errors¶

Symptom: Request timeout

Solutions:

Increase timeout:

response = client.chat.create(
    messages=[...],
    max_tokens=200,
    timeout=120,  # Longer timeout
)

Reduce max_tokens:

response = client.chat.create(
    messages=[...],
    max_tokens=50,  # Fewer tokens
)

Getting Help¶

If you're still stuck:

Check logs:
Server logs in notebook output
Python tracebacks

Gather information:

# System info
!nvidia-smi
!python --version
print(llamatelemetry.__version__)

Search issues:
GitHub Issues
Ask for help:
GitHub Discussions
Include: error message, code snippet, system info
Report bug:
New Issue
Use bug report template