Skip to content

Jupyter, Chat & Embeddings API Reference

llamatelemetry.jupyter provides JupyterLab-optimized features for interactive inference. llamatelemetry.chat provides OpenAI-compatible chat completion with conversation management. llamatelemetry.embeddings provides text embedding generation with caching, similarity search, and clustering.

from llamatelemetry.jupyter import (
    is_jupyter_available, check_dependencies,
    stream_generate, progress_generate,
    display_metrics, compare_temperatures, visualize_tokens,
    ChatWidget,
)
from llamatelemetry.chat import (
    Message, ChatEngine, ConversationManager,
)
from llamatelemetry.embeddings import (
    EmbeddingEngine, SemanticSearch, TextClustering,
    cosine_similarity, euclidean_distance, dot_product_similarity,
)

Jupyter Helpers

is_jupyter_available()

def is_jupyter_available() -> bool

Returns: True if running in a Jupyter/IPython environment.

check_dependencies()

def check_dependencies(require_widgets: bool = False) -> bool
Parameter Type Default Description
require_widgets bool False Whether to require ipywidgets

Returns: True if Jupyter and optional dependencies are available.


stream_generate()

def stream_generate(
    engine,
    prompt: str,
    max_tokens: int = 256,
    temperature: float = 0.7,
    show_timing: bool = True,
    markdown: bool = True,
    **kwargs,
) -> str
Parameter Type Default Description
engine InferenceEngine -- InferenceEngine instance
prompt str -- Input prompt
max_tokens int 256 Maximum tokens to generate
temperature float 0.7 Sampling temperature
show_timing bool True Display timing information after generation
markdown bool True Render output as markdown (vs. preformatted text)

Streams text generation with real-time IPython display updates. Falls back to non-streaming engine.infer() outside Jupyter.

Returns: Complete generated text.

from llamatelemetry.jupyter import stream_generate
text = stream_generate(engine, "Write a haiku about AI")

progress_generate()

def progress_generate(
    engine,
    prompts: List[str],
    max_tokens: int = 128,
    **kwargs,
) -> List[str]

Batch generation with tqdm progress bar. Falls back to print-based progress if tqdm is not installed.

Returns: List of generated texts (empty string for failed generations).


display_metrics()

def display_metrics(engine, as_dataframe: bool = True)

Displays performance metrics from engine.get_metrics() as a Pandas DataFrame or HTML table.


compare_temperatures()

def compare_temperatures(
    engine,
    prompt: str,
    temperatures: List[float] = [0.3, 0.7, 1.0, 1.5],
    max_tokens: int = 100,
) -> Dict[float, str]

Generates outputs at different temperature settings and displays them side-by-side.

Returns: Dictionary mapping temperature values to generated text.

results = compare_temperatures(engine, "The future of AI is", temperatures=[0.1, 0.7, 1.5])

visualize_tokens()

def visualize_tokens(text: str, engine=None)

Visualizes token boundaries in text. If engine is provided, uses the /tokenize endpoint to get actual token boundaries and displays them as styled HTML spans with token count.


ChatWidget

Interactive chat widget for JupyterLab with text input, send/clear buttons, conversation history, and model parameter sliders for temperature and max tokens.

ChatWidget(engine, system_prompt, max_tokens, temperature)

Parameter Type Default Description
engine InferenceEngine -- InferenceEngine instance
system_prompt Optional[str] None System prompt prepended to conversations
max_tokens int 256 Default max tokens (adjustable via slider)
temperature float 0.7 Default temperature (adjustable via slider)

Raises ImportError if ipywidgets is not installed.

ChatWidget.display()

def display(self) -> None

Renders the chat widget in the notebook output cell.

from llamatelemetry.jupyter import ChatWidget
chat = ChatWidget(engine, system_prompt="You are a helpful assistant.")
chat.display()

Message

Represents a single message in a conversation.

class Message:
    role: str                   # "system", "user", or "assistant"
    content: str                # Message text
    name: Optional[str]         # Optional sender name
    timestamp: float            # Unix timestamp (auto-set)

Message(role, content, name=None)

Parameter Type Default Description
role str -- Message role ("system", "user", "assistant")
content str -- Message content
name Optional[str] None Optional sender name

Message.to_dict()

def to_dict(self) -> Dict[str, str]

Returns: OpenAI-compatible message dict with role, content, and optionally name.


ChatEngine

Manages chat conversations with history, context window handling, and OpenAI-compatible chat completion support.

ChatEngine(engine, system_prompt, max_history, max_tokens, temperature)

Parameter Type Default Description
engine InferenceEngine -- InferenceEngine instance
system_prompt Optional[str] None System prompt (added as first message)
max_history int 20 Maximum messages to keep (trims non-system messages)
max_tokens int 256 Default max tokens
temperature float 0.7 Default temperature

ChatEngine.add_message() / add_system_message() / add_user_message() / add_assistant_message()

def add_message(self, role: str, content: str, name: Optional[str] = None) -> ChatEngine
def add_system_message(self, content: str) -> ChatEngine
def add_user_message(self, content: str) -> ChatEngine
def add_assistant_message(self, content: str) -> ChatEngine

All return self for method chaining. History is automatically trimmed when it exceeds max_history (system messages are always preserved).

ChatEngine.complete()

def complete(
    self,
    max_tokens: Optional[int] = None,
    temperature: Optional[float] = None,
    **kwargs,
) -> str

Generates a chat completion. Tries the OpenAI-compatible /v1/chat/completions endpoint first, then falls back to prompt-based completion via engine.infer(). Automatically adds the assistant's response to history.

Returns: Assistant's response text.

chat = ChatEngine(engine, system_prompt="You are helpful.")
chat.add_user_message("Explain photosynthesis")
response = chat.complete()
print(response)

ChatEngine.complete_stream()

def complete_stream(
    self,
    max_tokens: Optional[int] = None,
    temperature: Optional[float] = None,
    **kwargs,
) -> Iterator[str]

Streaming chat completion via SSE. Falls back to non-streaming if the streaming endpoint is unavailable.

Yields: Text chunks as they are generated.

for chunk in chat.complete_stream():
    print(chunk, end='', flush=True)

ChatEngine.clear_history()

def clear_history(self, keep_system: bool = True) -> ChatEngine

ChatEngine.get_history()

def get_history(self) -> List[Dict[str, str]]

Returns: List of message dicts in OpenAI format.

ChatEngine.save_history() / ChatEngine.load_history()

def save_history(self, filepath: str) -> None
def load_history(self, filepath: str) -> ChatEngine

Persists conversation to JSON including messages and metadata (max_history, max_tokens, temperature).

ChatEngine.count_tokens()

def count_tokens(self) -> int

Estimates token count for the current conversation. Uses the /tokenize endpoint if available, otherwise approximates at 1 token per 4 characters.


ConversationManager

Manages multiple named conversation sessions, allowing switching between contexts.

ConversationManager(engine)

Parameter Type Description
engine InferenceEngine Shared InferenceEngine instance

ConversationManager Methods

Method Signature Returns Description
create_conversation(name, system_prompt, **kwargs) str, Optional[str] ChatEngine Create a new conversation
switch_to(name) str ChatEngine Switch active conversation
get_current() -- ChatEngine Get active conversation
chat(message, **kwargs) str str Send message to active conversation
list_conversations() -- List[str] List all conversation names
delete_conversation(name) str None Delete a conversation
save_all(directory) str None Save all conversations to directory
load_all(directory) str None Load all conversations from directory
manager = ConversationManager(engine)
manager.create_conversation("coding", "You are a coding assistant.")
manager.create_conversation("writing", "You are a writing coach.")
manager.switch_to("coding")
response = manager.chat("How do I write a Python decorator?")
manager.save_all("conversations/")

EmbeddingEngine

Text embedding generation with caching and support for both OpenAI-compatible and native llama-server endpoints.

EmbeddingEngine(engine, pooling, normalize, cache_size)

Parameter Type Default Description
engine InferenceEngine -- InferenceEngine instance
pooling str "mean" Pooling strategy ("mean", "cls", "last")
normalize bool True Normalize embeddings to unit vectors
cache_size int 1000 Maximum embeddings to cache (FIFO eviction)

EmbeddingEngine.embed()

def embed(self, text: str, use_cache: bool = True) -> np.ndarray

Generates embedding for a single text. Tries /v1/embeddings first, then /embedding.

Returns: 1D numpy float32 array.

EmbeddingEngine.embed_batch()

def embed_batch(
    self,
    texts: List[str],
    use_cache: bool = True,
    show_progress: bool = False,
) -> np.ndarray

Returns: 2D numpy array of shape (n_texts, embedding_dim).

EmbeddingEngine Cache Methods

Method Returns Description
clear_cache() None Clear cache and reset stats
get_cache_stats() Dict Returns cache_size, cache_max, hits, misses, hit_rate
save_cache(filepath) None Persist cache to JSON
load_cache(filepath) None Load cache from JSON
embedder = EmbeddingEngine(engine, normalize=True)
vec = embedder.embed("Hello world")
print(f"Dimension: {len(vec)}")
print(embedder.get_cache_stats())

Similarity Functions

cosine_similarity()

def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float

Returns: Cosine similarity (0 to 1 for normalized vectors).

euclidean_distance()

def euclidean_distance(vec1: np.ndarray, vec2: np.ndarray) -> float

Returns: Euclidean distance between vectors.

dot_product_similarity()

def dot_product_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float

Returns: Dot product (equivalent to cosine similarity for unit vectors).


SemanticSearch

Vector similarity search over a document index.

SemanticSearch(embedder)

Parameter Type Description
embedder EmbeddingEngine Embedding engine for vectorization

SemanticSearch.add_documents()

def add_documents(
    self,
    documents: List[str],
    metadata: Optional[List[Dict[str, Any]]] = None,
    show_progress: bool = False,
)

Embeds documents and adds them to the search index. Can be called multiple times to incrementally build the index.

SemanticSearch.search()

def search(
    self,
    query: str,
    top_k: int = 5,
    similarity_fn: str = "cosine",
) -> List[Tuple[str, float, Dict[str, Any]]]
Parameter Type Default Description
query str -- Search query text
top_k int 5 Number of results
similarity_fn str "cosine" "cosine", "dot", or "euclidean"

Returns: List of (document_text, score, metadata) tuples, sorted by descending similarity.

search = SemanticSearch(embedder)
search.add_documents([
    "Python is a programming language",
    "Machine learning uses neural networks",
    "Natural language processing handles text",
])
results = search.search("What is NLP?", top_k=2)
for doc, score, meta in results:
    print(f"{score:.3f}: {doc}")

SemanticSearch.save_index() / load_index() / clear_index()

def save_index(self, filepath: str) -> None
def load_index(self, filepath: str) -> None
def clear_index(self) -> None

TextClustering

K-means text clustering using embeddings.

TextClustering(embedder, n_clusters)

Parameter Type Default Description
embedder EmbeddingEngine -- Embedding engine
n_clusters int 5 Number of clusters

Requires scikit-learn.

TextClustering.fit()

def fit(self, texts: List[str], show_progress: bool = False) -> np.ndarray

Returns: Cluster label array of shape (n_texts,).

TextClustering.get_clusters()

def get_clusters(self, texts: List[str], labels: np.ndarray) -> Dict[int, List[str]]

Returns: Dict mapping cluster ID to list of texts in that cluster.

TextClustering.predict()

def predict(self, texts: List[str]) -> np.ndarray

Assigns new texts to nearest cluster centers. Raises RuntimeError if fit() has not been called.

clustering = TextClustering(embedder, n_clusters=3)
labels = clustering.fit(texts)
clusters = clustering.get_clusters(texts, labels)
for cluster_id, docs in clusters.items():
    print(f"Cluster {cluster_id}: {len(docs)} documents")