Quickstart¶
This quickstart keeps to the most reliable path in the current SDK:
- import the package
- verify CUDA visibility
- create an
InferenceEngine - load a GGUF model
- run inference
- inspect basic metrics
For Kaggle-specific setup, use the separate Kaggle Quickstart.
1. Import and inspect the environment¶
import llamatelemetry as lt
print("version:", lt.__version__)
print("server path:", lt.get_llama_cpp_cuda_path())
print("cuda info:", lt.detect_cuda())
The important thing here is not matching some exact printed format. The goal is
simply to confirm that the package imports, sees your runtime, and can find the
llama-server path it expects to use.
2. Create an inference engine¶
import llamatelemetry as lt
engine = lt.InferenceEngine(
server_url="http://127.0.0.1:8080",
enable_telemetry=False,
)
InferenceEngine is the highest-level API in the current package.
3. Load a model¶
The SDK supports three practical model-loading patterns.
Option A: built-in registry name¶
Option B: Hugging Face repo plus file¶
Option C: local GGUF file¶
The safest documentation stance is this: use small or moderate GGUF models first, confirm your runtime is stable, then move to larger models.
4. Run one inference request¶
result = engine.infer(
prompt="Explain what GGUF is in two sentences.",
max_tokens=128,
temperature=0.7,
top_p=0.9,
top_k=40,
)
print("success:", result.success)
print("text:", result.text)
print("tokens:", result.tokens_generated)
print("latency_ms:", result.latency_ms)
print("tokens_per_sec:", result.tokens_per_sec)
The InferResult object is one of the strongest parts of the current public API
because it gives you the generated text plus simple performance signals in one
place.
5. Run batch inference¶
The current SDK exposes batch_infer().
prompts = [
"Define CUDA in one sentence.",
"Define quantization in one sentence.",
"Define observability in one sentence.",
]
results = engine.batch_infer(prompts, max_tokens=64)
for i, r in enumerate(results):
print("---", i)
print("success:", r.success)
print("text:", r.text)
6. Inspect engine metrics¶
This is the simplest way to see the in-process aggregate counters the engine has collected during your session.
7. Check server-side endpoints when needed¶
If the underlying llama-server is running, ServerManager and the lower-level
client APIs can expose health and metrics endpoints. A simple pattern is:
from llamatelemetry import ServerManager
manager = ServerManager(server_url="http://127.0.0.1:8080")
print(manager.check_server_health())
print(manager.get_health())
You can also query the Prometheus-style metrics text:
8. Clean up¶
Or use a context manager:
import llamatelemetry as lt
with lt.InferenceEngine(enable_telemetry=False) as engine:
engine.load_model("gemma-3-1b-Q4_K_M", auto_start=True)
result = engine.infer("Hello from llamatelemetry.")
print(result.text)
A realistic first workflow¶
For a first successful run, this sequence is usually enough:
import llamatelemetry as lt
engine = lt.InferenceEngine(enable_telemetry=False)
engine.load_model("gemma-3-1b-Q4_K_M", auto_start=True)
result = engine.infer("What does this SDK do?", max_tokens=96)
print(result.text)
print(engine.get_metrics())
engine.unload_model()
What this page intentionally does not claim¶
This page avoids a few claims that were too broad in the earlier docs:
- it does not claim token streaming as a stable top-level
InferenceEngineAPI because the uploaded snapshot does not exposestream_infer()on that class - it does not claim every advanced integration is equally validated
- it does not assume all local machines behave like Kaggle dual-T4 notebooks
For telemetry-specific setup, continue to the Telemetry and Observability Guide.