Quickstart¶

This quickstart keeps to the most reliable path in the current SDK:

import the package
verify CUDA visibility
create an InferenceEngine
load a GGUF model
run inference
inspect basic metrics

For Kaggle-specific setup, use the separate Kaggle Quickstart.

1. Import and inspect the environment¶

import llamatelemetry as lt

print("version:", lt.__version__)
print("server path:", lt.get_llama_cpp_cuda_path())
print("cuda info:", lt.detect_cuda())

The important thing here is not matching some exact printed format. The goal is simply to confirm that the package imports, sees your runtime, and can find the llama-server path it expects to use.

2. Create an inference engine¶

import llamatelemetry as lt

engine = lt.InferenceEngine(
    server_url="http://127.0.0.1:8080",
    enable_telemetry=False,
)

InferenceEngine is the highest-level API in the current package.

3. Load a model¶

The SDK supports three practical model-loading patterns.

Option A: built-in registry name¶

engine.load_model(
    "gemma-3-1b-Q4_K_M",
    auto_start=True,
    auto_configure=True,
    verbose=True,
)

Option B: Hugging Face repo plus file¶

engine.load_model(
    "bartowski/gemma-2-2b-it-GGUF:gemma-2-2b-it-Q4_K_M.gguf",
    auto_start=True,
)

Option C: local GGUF file¶

engine.load_model(
    "/path/to/model.gguf",
    auto_start=True,
    gpu_layers=99,
    ctx_size=4096,
)

The safest documentation stance is this: use small or moderate GGUF models first, confirm your runtime is stable, then move to larger models.

4. Run one inference request¶

result = engine.infer(
    prompt="Explain what GGUF is in two sentences.",
    max_tokens=128,
    temperature=0.7,
    top_p=0.9,
    top_k=40,
)

print("success:", result.success)
print("text:", result.text)
print("tokens:", result.tokens_generated)
print("latency_ms:", result.latency_ms)
print("tokens_per_sec:", result.tokens_per_sec)

The InferResult object is one of the strongest parts of the current public API because it gives you the generated text plus simple performance signals in one place.

5. Run batch inference¶

The current SDK exposes batch_infer().

prompts = [
    "Define CUDA in one sentence.",
    "Define quantization in one sentence.",
    "Define observability in one sentence.",
]

results = engine.batch_infer(prompts, max_tokens=64)

for i, r in enumerate(results):
    print("---", i)
    print("success:", r.success)
    print("text:", r.text)

6. Inspect engine metrics¶

metrics = engine.get_metrics()
print(metrics)

This is the simplest way to see the in-process aggregate counters the engine has collected during your session.

7. Check server-side endpoints when needed¶

If the underlying llama-server is running, ServerManager and the lower-level client APIs can expose health and metrics endpoints. A simple pattern is:

from llamatelemetry import ServerManager

manager = ServerManager(server_url="http://127.0.0.1:8080")
print(manager.check_server_health())
print(manager.get_health())

You can also query the Prometheus-style metrics text:

print(manager.get_metrics())

8. Clean up¶

engine.unload_model()

Or use a context manager:

import llamatelemetry as lt

with lt.InferenceEngine(enable_telemetry=False) as engine:
    engine.load_model("gemma-3-1b-Q4_K_M", auto_start=True)
    result = engine.infer("Hello from llamatelemetry.")
    print(result.text)

A realistic first workflow¶

For a first successful run, this sequence is usually enough:

import llamatelemetry as lt

engine = lt.InferenceEngine(enable_telemetry=False)
engine.load_model("gemma-3-1b-Q4_K_M", auto_start=True)
result = engine.infer("What does this SDK do?", max_tokens=96)
print(result.text)
print(engine.get_metrics())
engine.unload_model()

What this page intentionally does not claim¶

This page avoids a few claims that were too broad in the earlier docs:

it does not claim token streaming as a stable top-level InferenceEngine API because the uploaded snapshot does not expose stream_infer() on that class
it does not claim every advanced integration is equally validated
it does not assume all local machines behave like Kaggle dual-T4 notebooks

For telemetry-specific setup, continue to the Telemetry and Observability Guide.