Skip to content

Troubleshooting

Common issues and solutions for llamatelemetry v0.1.0.


Installation Issues

Binary Download Fails

Symptom: Error during import llamatelemetry or binary download timeout

Solutions:

  1. Clear cache and retry:

    import shutil
    shutil.rmtree("/root/.cache/llamatelemetry/", ignore_errors=True)
    
    !pip install -q --no-cache-dir --force-reinstall \
        git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0
    

  2. Check internet connection:

    !ping -c 3 huggingface.co
    

  3. Manual download from HuggingFace Hub

CUDA Not Available

Symptom: check_cuda_available() returns False

Solutions:

  1. Verify CUDA installation:

    !nvcc --version
    !nvidia-smi
    

  2. Check Kaggle accelerator settings:

  3. Settings → Accelerator → GPU T4 x2
  4. Restart notebook session

  5. Verify GPU detection:

    from llamatelemetry.api.multigpu import gpu_count
    print(f"GPUs: {gpu_count()}")
    

Version Conflicts

Symptom: ImportError or version mismatch warnings

Solution: Force reinstall with no cache:

!pip install -q --no-cache-dir --force-reinstall \
    git+https://github.com/llamatelemetry/llamatelemetry.git@v0.1.0


Server Issues

Server Won't Start

Symptom: start_server() fails or hangs

Solutions:

  1. Check if port 8080 is in use:

    import socket
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    result = sock.connect_ex(('127.0.0.1', 8080))
    print("Port in use" if result == 0 else "Port free")
    sock.close()
    

  2. Use different port:

    server.start_server(
        model_path=model_path,
        port=8081,  # Different port
    )
    

  3. Check model file exists:

    import os
    print(os.path.exists(model_path))
    

Out of Memory (OOM)

Symptom: CUDA out of memory error

Solutions:

  1. Use smaller model:

    # Instead of 4B model
    model_path = hf_hub_download(
        repo_id="unsloth/gemma-3-1b-it-GGUF",  # Use 1B
        filename="gemma-3-1b-it-Q4_K_M.gguf",
    )
    

  2. Reduce context size:

    server.start_server(
        model_path=model_path,
        ctx_size=4096,  # Reduce from 8192
    )
    

  3. Check VRAM usage:

    !nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv
    

Slow Inference

Symptom: Low tokens/sec throughput

Solutions:

  1. Enable FlashAttention:

    server.start_server(
        model_path=model_path,
        flash_attn=1,  # Enable
    )
    

  2. Use GPU 0 only:

    server.start_server(
        model_path=model_path,
        tensor_split="1.0,0.0",  # GPU 0 only
    )
    

  3. Verify GPU utilization:

    !nvidia-smi --query-gpu=index,utilization.gpu --format=csv
    


OpenTelemetry Issues

No Spans Captured

Symptom: get_finished_spans() returns empty list

Solutions:

  1. Attach exporter BEFORE making requests:

    # Set up telemetry FIRST
    from llamatelemetry.telemetry import setup_telemetry
    tracer, meter = setup_telemetry(service_name="llm")
    
    # THEN make requests
    with tracer.start_as_current_span("request"):
        response = client.chat.create(...)
    

  2. Use SimpleSpanProcessor for testing:

    from opentelemetry.sdk.trace.export import SimpleSpanProcessor
    
    processor = SimpleSpanProcessor(memory_exporter)
    trace.get_tracer_provider().add_span_processor(processor)
    

  3. Check spans were actually created:

    spans = memory_exporter.get_finished_spans()
    print(f"Captured {len(spans)} spans")
    

OTLP Export Fails with 404

Symptom: OTLP exporter returns 404 error

Solution: Use explicit endpoint paths:

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317/v1/traces",  # Explicit /v1/traces
    insecure=True
)

Metrics Not Recorded

Symptom: GPU metrics not appearing

Solutions:

  1. Verify PyNVML is working:

    import pynvml
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)
    info = pynvml.nvmlDeviceGetMemoryInfo(handle)
    print(f"VRAM: {info.used / 1024**2:.0f} MB")
    

  2. Check meter is properly configured:

    from llamatelemetry.telemetry import setup_telemetry
    
    tracer, meter = setup_telemetry(service_name="llm")
    print(f"Meter: {meter}")
    


Graphistry Issues

Connection Fails

Symptom: Graphistry plot fails to render

Solutions:

  1. Verify registration:

    import graphistry
    graphistry.register(
        api=3,
        protocol="https",
        server="hub.graphistry.com",
        token="YOUR_TOKEN"  # Get from graphistry.com
    )
    

  2. Check internet connection:

    !ping -c 3 hub.graphistry.com
    

  3. Use GPU 1 explicitly:

    import os
    os.environ['CUDA_VISIBLE_DEVICES'] = '1'
    
    # Then import graphistry
    import graphistry
    

No Edges Error

Symptom: "DataFrame has no edges" error

Solution: Verify spans have parent-child relationships:

spans = memory_exporter.get_finished_spans()
edges = [(s.parent.span_id, s.context.span_id) for s in spans if s.parent]
print(f"Found {len(edges)} edges")

if len(edges) == 0:
    print("No parent-child relationships found!")

Plot Doesn't Render

Symptom: Graphistry returns URL but doesn't display

Solutions:

  1. Open URL manually:

    g = TracesGraphistry(spans=spans)
    url = g.plot(render=False)
    print(f"Open: {url}")
    

  2. Check Kaggle allows iframes:

  3. Graphistry plots may not render in Kaggle notebooks
  4. Copy URL and open in new tab

Model Issues

Model Download Fails

Symptom: HuggingFace download timeout or error

Solutions:

  1. Check internet connection:

    !ping -c 3 huggingface.co
    

  2. Use explicit token:

    from huggingface_hub import login
    login(token="YOUR_HF_TOKEN")
    

  3. Try different model repo:

    # If unsloth repo fails, try bartowski
    model_path = hf_hub_download(
        repo_id="bartowski/gemma-3-4b-it-GGUF",
        filename="gemma-3-4b-it-Q4_K_M.gguf",
    )
    

Wrong Quantization

Symptom: Model quality is poor

Solution: Check quantization level:

  • Q2/Q3: Very low quality, not recommended
  • Q4: Good balance (recommended)
  • Q5/Q6: Better quality, slower
  • Q8: Best quality, much slower

Use Q4_K_M for best balance:

filename="model-Q4_K_M.gguf"  # Recommended


Performance Issues

Low Tokens/sec

Symptom: Inference is slower than expected

Solutions:

  1. Enable FlashAttention:

    server.start_server(model_path=path, flash_attn=1)
    

  2. Use dedicated GPU (tensor_split):

    server.start_server(model_path=path, tensor_split="1.0,0.0")
    

  3. Reduce batch size for streaming:

    server.start_server(
        model_path=path,
        batch_size=512,
        ubatch_size=128,  # Smaller for streaming
    )
    

  4. Check GPU utilization:

    watch -n 1 nvidia-smi
    

High Memory Usage

Symptom: Running out of VRAM

Solutions:

  1. Reduce context size:

    server.start_server(model_path=path, ctx_size=4096)
    

  2. Use more aggressive quantization:

    filename="model-Q4_0.gguf"  # Instead of Q5_K_M
    

  3. Offload some layers to CPU:

    server.start_server(
        model_path=path,
        gpu_layers=20,  # Only 20 layers on GPU
    )
    


Client Issues

Connection Refused

Symptom: Client can't connect to server

Solutions:

  1. Verify server is running:

    import requests
    response = requests.get("http://127.0.0.1:8080/health")
    print(response.status_code)  # Should be 200
    

  2. Check correct port:

    client = LlamaCppClient(base_url="http://127.0.0.1:8080")  # Default port
    

  3. Wait for server to be ready:

    server.start_server(model_path=path)
    time.sleep(10)  # Wait for server initialization
    

Timeout Errors

Symptom: Request timeout

Solutions:

  1. Increase timeout:

    response = client.chat.create(
        messages=[...],
        max_tokens=200,
        timeout=120,  # Longer timeout
    )
    

  2. Reduce max_tokens:

    response = client.chat.create(
        messages=[...],
        max_tokens=50,  # Fewer tokens
    )
    


Getting Help

If you're still stuck:

  1. Check logs:
  2. Server logs in notebook output
  3. Python tracebacks

  4. Gather information:

    # System info
    !nvidia-smi
    !python --version
    print(llamatelemetry.__version__)
    

  5. Search issues:

  6. GitHub Issues

  7. Ask for help:

  8. GitHub Discussions
  9. Include: error message, code snippet, system info

  10. Report bug:

  11. New Issue
  12. Use bug report template