Skip to content

CUDA & Inference API Reference

llamatelemetry.cuda provides low-level CUDA optimizations for Tesla T4 inference, including CUDA Graphs, Triton custom kernels, and Tensor Core utilities. llamatelemetry.inference provides advanced inference capabilities including FlashAttention integration, KV-cache optimization, and batch inference strategies.

from llamatelemetry.cuda import (
    CUDAGraph, GraphPool, capture_graph, replay_graph, enable_cuda_graphs,
    TritonKernel, register_kernel, get_kernel, list_kernels,
    TensorCoreConfig, enable_tensor_cores, matmul_tensor_core, check_tensor_core_support,
)
from llamatelemetry.inference import (
    FlashAttentionConfig, enable_flash_attention, flash_attention_forward,
    KVCache, KVCacheConfig, PagedKVCache, optimize_kv_cache,
    BatchInferenceOptimizer, ContinuousBatching, batch_inference_optimized,
)

CUDAGraph

CUDA Graph wrapper for PyTorch operations. Captures a sequence of CUDA operations and replays them with minimal CPU overhead, providing 20-40% latency reduction for small batch sizes on Tesla T4.

CUDAGraph(config=None)

Parameter Type Default Description
config Optional[GraphCaptureConfig] None Capture configuration
@dataclass
class GraphCaptureConfig:
    pool: Optional[str] = None
    capture_error_mode: str = "thread_local"
    warmup_iters: int = 3

CUDAGraph.capture()

def capture(
    self,
    func: Optional[Callable] = None,
    inputs: Optional[Dict[str, torch.Tensor]] = None,
    warmup: bool = True,
) -> Any
Parameter Type Default Description
func Optional[Callable] None Function to capture (use context manager if None)
inputs Optional[Dict[str, Tensor]] None Static input tensors
warmup bool True Run warmup iterations before capture

Can be used as a context manager or with an explicit function.

# Context manager style
graph = CUDAGraph()
with graph.capture():
    output = model(input_tensor)

# Function style
graph = CUDAGraph()
graph.capture(lambda: model(input_tensor))

CUDAGraph.replay()

def replay(self) -> Any

Replay the captured graph. Raises RuntimeError if graph has not been captured.

Returns: Outputs from graph replay (tensor, dict, or tuple depending on capture).

for _ in range(1000):
    output = graph.replay()  # Minimal CPU overhead

CUDAGraph.is_captured() / CUDAGraph.reset()

def is_captured(self) -> bool
def reset(self) -> None

GraphPool

Pool of CUDA graphs for managing multiple graph instances with different operations or input shapes.

GraphPool Methods

Method Signature Returns Description
capture(name, func, inputs, config) str, Callable, ... str Capture and store a named graph
replay(name) str Any Replay graph by name
get(name) str CUDAGraph Get graph by name
remove(name) str None Remove and reset graph
clear() -- None Clear all graphs
list_graphs() -- List[str] List all graph names
pool = GraphPool()
pool.capture("forward", lambda: model(x))
pool.capture("backward", lambda: loss.backward())
output = pool.replay("forward")

capture_graph() / replay_graph() / enable_cuda_graphs()

def capture_graph(
    func: Callable,
    inputs: Optional[Dict[str, torch.Tensor]] = None,
    warmup_iters: int = 3,
) -> CUDAGraph

def replay_graph(graph: CUDAGraph) -> Any

def enable_cuda_graphs(model: torch.nn.Module) -> torch.nn.Module

capture_graph is a convenience function that creates a CUDAGraph, captures the function, and returns it. enable_cuda_graphs wraps a model to use CUDA graphs when beneficial.

graph = capture_graph(lambda: model(x), warmup_iters=3)
for _ in range(1000):
    output = graph.replay()

TritonKernel

Wrapper for Triton JIT-compiled kernels with automatic grid computation and configuration.

TritonKernel(name, kernel_func, config)

Parameter Type Default Description
name str -- Kernel name
kernel_func Optional[Callable] None Triton @triton.jit function
config Optional[KernelConfig] None Kernel configuration
@dataclass
class KernelConfig:
    name: str
    block_size: int = 128
    num_warps: int = 4
    num_stages: int = 2

TritonKernel.launch()

def launch(self, *args, grid: Optional[Tuple[int, ...]] = None, **kwargs)

Launch the kernel. If grid is None, auto-computes from the first tensor argument's size.

Kernel Registry Functions

def register_kernel(name: str, kernel_func: Callable, config: Optional[KernelConfig] = None) -> TritonKernel
def get_kernel(name: str) -> Optional[TritonKernel]
def list_kernels() -> List[str]

Built-in Kernels (Tesla T4 optimized)

When Triton is available, three kernels are auto-registered:

Kernel Name Description
add_kernel "add" Optimized element-wise addition
fused_layernorm_kernel "layernorm" Fused LayerNorm with RMS normalization
softmax_kernel "softmax" Numerically stable softmax

High-level Kernel Functions

def triton_add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor
def triton_layernorm(x: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor, eps: float = 1e-5) -> torch.Tensor
def triton_softmax(x: torch.Tensor, dim: int = -1) -> torch.Tensor

All functions fall back to PyTorch equivalents when Triton is not available.

a = torch.randn(1024, device='cuda')
b = torch.randn(1024, device='cuda')
c = triton_add(a, b)  # Uses Triton kernel if available

TensorCoreConfig

Configuration for Tensor Core operations on Tesla T4 (SM 7.5, Turing architecture).

@dataclass
class TensorCoreConfig:
    enabled: bool = True
    dtype: torch.dtype = torch.float16
    allow_tf32: bool = True
    allow_fp16: bool = True

check_tensor_core_support()

def check_tensor_core_support(device: int = 0) -> bool

Returns: True if device supports Tensor Cores (SM >= 7.0). Tesla T4 is SM 7.5.

enable_tensor_cores()

def enable_tensor_cores(
    dtype: torch.dtype = torch.float16,
    allow_tf32: bool = True,
) -> TensorCoreConfig

Enables Tensor Core optimizations globally: sets torch.backends.cuda.matmul.allow_tf32, enables cuDNN benchmark mode.

Returns: TensorCoreConfig with applied settings.

matmul_tensor_core()

def matmul_tensor_core(
    A: torch.Tensor,
    B: torch.Tensor,
    out: Optional[torch.Tensor] = None,
    dtype: torch.dtype = torch.float16,
) -> torch.Tensor
Parameter Type Default Description
A torch.Tensor -- First matrix [M, K]
B torch.Tensor -- Second matrix [K, N]
out Optional[torch.Tensor] None Optional output tensor
dtype torch.dtype torch.float16 Computation dtype for Tensor Cores

Converts inputs to FP16, performs matmul with Tensor Cores, converts back to original dtype.

A = torch.randn(1024, 2048, device='cuda')
B = torch.randn(2048, 4096, device='cuda')
C = matmul_tensor_core(A, B)  # Fast FP16 Tensor Core matmul

get_tensor_core_info()

def get_tensor_core_info(device: int = 0) -> dict

Returns: Dict with device, compute_capability, supported, architecture, fp16_tflops (65 for T4), int8_tops (130 for T4), estimated_speedup.

optimize_for_tensor_cores()

def optimize_for_tensor_cores(model: torch.nn.Module, dtype: torch.dtype = torch.float16) -> torch.nn.Module

Moves model to CUDA, converts to FP16, enables eval mode and cuDNN benchmark.

TensorCoreMatMul

Drop-in torch.nn.Module replacement for torch.matmul with automatic FP16 conversion.

matmul = TensorCoreMatMul(dtype=torch.float16)
C = matmul(A, B)

FlashAttentionConfig

Configuration for FlashAttention v2/v3 integration.

@dataclass
class FlashAttentionConfig:
    enabled: bool = True
    version: int = 2
    causal: bool = True
    dropout_p: float = 0.0
    softmax_scale: Optional[float] = None
    window_size: Optional[Tuple[int, int]] = None  # Sliding window

enable_flash_attention()

def enable_flash_attention(
    model: torch.nn.Module,
    config: Optional[FlashAttentionConfig] = None,
) -> torch.nn.Module

Enables FlashAttention for a model. Returns the model unchanged if FlashAttention is not installed or already enabled.

flash_attention_forward()

def flash_attention_forward(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    causal: bool = True,
    dropout_p: float = 0.0,
    softmax_scale: Optional[float] = None,
    window_size: Optional[Tuple[int, int]] = None,
) -> torch.Tensor
Parameter Type Default Description
query torch.Tensor -- [batch, seqlen, num_heads, head_dim]
key torch.Tensor -- [batch, seqlen, num_heads, head_dim]
value torch.Tensor -- [batch, seqlen, num_heads, head_dim]
causal bool True Use causal masking
dropout_p float 0.0 Dropout probability
softmax_scale Optional[float] None Scale factor (default: 1/sqrt(d))
window_size Optional[Tuple[int, int]] None Sliding window size

Returns: Attention output [batch, seqlen, num_heads, head_dim]. Falls back to standard attention when flash-attn is not installed.

q = torch.randn(2, 2048, 32, 64, device='cuda', dtype=torch.float16)
k = torch.randn(2, 2048, 32, 64, device='cuda', dtype=torch.float16)
v = torch.randn(2, 2048, 32, 64, device='cuda', dtype=torch.float16)
output = flash_attention_forward(q, k, v, causal=True)

get_optimal_context_length()

def get_optimal_context_length(
    model_size_b: float,
    available_vram_gb: float,
    use_flash_attention: bool = True,
) -> int

Estimates optimal context length for given VRAM. With FlashAttention, scales linearly; without, scales quadratically.

ctx = get_optimal_context_length(1.0, 12.0, use_flash_attention=True)
# Returns 8192 for 1B model with 12GB VRAM and FlashAttention

KVCache / KVCacheConfig

Key-Value cache management for efficient sequential generation.

KVCacheConfig

@dataclass
class KVCacheConfig:
    max_batch_size: int = 8
    max_seq_length: int = 4096
    num_layers: int = 32
    num_heads: int = 32
    head_dim: int = 128
    dtype: torch.dtype = torch.float16

KVCache(config)

class KVCache:
    def update(self, layer_idx: int, k: torch.Tensor, v: torch.Tensor,
               positions: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor]
    def get(self, layer_idx: int) -> Optional[Tuple[torch.Tensor, torch.Tensor]]
    def clear(self) -> None

The update method concatenates new key/value tensors with existing cached values for a layer. Returns the full cached key and value tensors.

PagedKVCache

vLLM-style paged KV-cache for reduced memory fragmentation (simplified implementation).

class PagedKVCache:
    def __init__(self, config: KVCacheConfig, page_size: int = 16): ...

optimize_kv_cache()

def optimize_kv_cache(model: torch.nn.Module) -> torch.nn.Module

Adds _kv_cache attribute to model for cache management.


BatchInferenceOptimizer

Optimized batching strategies for maximizing throughput.

BatchConfig

@dataclass
class BatchConfig:
    max_batch_size: int = 8
    max_tokens: int = 2048
    dynamic_batching: bool = True

BatchInferenceOptimizer(config)

class BatchInferenceOptimizer:
    def batch_infer(
        self,
        prompts: List[str],
        inference_fn: Callable,
        **kwargs,
    ) -> List[Any]

Splits prompts into batches of max_batch_size and processes each batch sequentially.

optimizer = BatchInferenceOptimizer(BatchConfig(max_batch_size=8))
results = optimizer.batch_infer(prompts, engine.infer)

ContinuousBatching

Continuous batching for overlapping generation (vLLM-style, simplified).

class ContinuousBatching:
    def __init__(self, max_batch_size: int = 8): ...

batch_inference_optimized()

def batch_inference_optimized(
    prompts: List[str],
    model: Any,
    max_batch_size: int = 8,
    **kwargs,
) -> List[Any]

Convenience function. Accepts any callable or object with an infer method.

results = batch_inference_optimized(prompts, engine, max_batch_size=4)