ollama/x
Jesse Gross 96e36c0d90 mlxrunner: share KV cache across conversations with common prefixes
Enable multiple conversations to reuse cached computations when they
share token prefixes (e.g. the same system prompt). A prefix trie
tracks shared regions so switching between conversations only
recomputes tokens that diverge. Inactive conversation state is paged
from active GPU memory to other memory and restored on demand, with LRU
eviction to keep memory usage bounded.
2026-03-18 16:06:33 -07:00
..
agent x/cmd: enable web search and web fetch with flag (#13690) 2026-01-12 13:59:40 -08:00
cmd Reapply "don't require pulling stubs for cloud models" again (#14608) 2026-03-06 14:27:47 -08:00
create mlx: add prequantized tensor packing + changes for qwen35 (#14878) 2026-03-17 11:21:18 -07:00
imagegen mlx: add prequantized tensor packing + changes for qwen35 (#14878) 2026-03-17 11:21:18 -07:00
mlxrunner mlxrunner: share KV cache across conversations with common prefixes 2026-03-18 16:06:33 -07:00
models mlx: quantized embeddings, fast SwiGLU, and runtime fixes (#14884) 2026-03-17 11:21:38 -07:00
server bugfix: display the parameter count correctly in mlx for ollama show (#14285) 2026-02-16 13:03:34 -08:00
tokenizer mlx: quantized embeddings, fast SwiGLU, and runtime fixes (#14884) 2026-03-17 11:21:38 -07:00
tools add ability to disable cloud (#14221) 2026-02-12 15:47:00 -08:00