Commit graph

5339 commits

Author SHA1 Message Date
Devon Rifkin c9b2dcfc52
anthropic: fix empty inputs in content blocks (#15105)
* anthropic: fix empty inputs in content blocks

When we switched to `api.ToolCallFunctionArguments`, `omitempty` stopped
doing what we were relying on it for before. This would cause non-tool
content blocks to have an `"input": {}` field, which doesn't match our
old behavior.

* use omitzero instead
2026-03-27 15:41:27 -07:00
Parth Sareen b00bd1dfd4
launch: skip context length warning for MLX models and show model name (#15102) 2026-03-27 15:01:33 -07:00
Jesse Gross ac83ac20c4 anthropic: fix KV cache reuse degraded by tool call argument reordering
Use typed structs for tool call arguments instead of map[string]any to
preserve JSON key order, which Go maps do not guarantee.
2026-03-27 14:30:16 -07:00
Bruce MacDonald e7ccc129ea
app: fix false "out of date" model warnings (#15101)
The staleness check compared the local manifest digest (SHA256 of the
file on disk) against the registry's Ollama-Content-Digest header.
These never matched because PullModel re-serializes the manifest JSON
before writing, producing different bytes than the registry's original.

The fallback comparison (local modified_at vs upstream push time) was
also broken: the generated TypeScript Time class discards the actual
timestamp value, so Date parsing always produced NaN.

Fix by moving the staleness comparison server-side where we have
reliable access to both the local manifest file mtime and the upstream
push time. The /api/v1/model/upstream endpoint now returns a simple
`stale` boolean instead of raw digests for the frontend to compare.

Also adds User-Agent to the CORS allowed headers for dev mode.
2026-03-27 14:15:10 -07:00
Jeffrey Morgan 69ed0c2729
parsers: qwen3.5 streaming tool-call parsing and add regression test (#15098) 2026-03-27 14:04:14 -07:00
Alfredo Matas 1cefa749aa
model/parsers: close think block if tool block starts in Qwen3.5 (#15022) 2026-03-27 11:28:34 -07:00
Daniel Hiltgen aec2fef95d
ci: harden cuda include path handling (#15093)
On windows we can get multiple include dirs, so find where the headers are then
copy from that location.
2026-03-27 07:57:07 -07:00
Eva H 366625a831
launch: warn when server context length is below 64k for local models (#15044)
A stop-gap for now to guide users better. We'll add more in-depth recommendations per integration as well.

---------

Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
2026-03-27 00:15:53 -07:00
Daniel Hiltgen 516ebd8548
ci: include mlx jit headers on linux (#15083)
* ci: include mlx jit headers on linux

* handle CUDA JIT headers
2026-03-26 23:10:07 -07:00
Parth Sareen f567abc63f
tui: update chat title (#15082) 2026-03-26 18:06:53 -07:00
Eva H 1adfc27f04
launch/vscode: prefer known vs code paths over code on PATH (#15073) 2026-03-26 18:06:28 -04:00
Parth Sareen 4a2b9f9dbc
launch: hide cline integration (#15080) 2026-03-26 14:33:43 -07:00
Parth Sareen e46b67a6cc
launch: hide vs code (#15076) 2026-03-26 13:52:50 -07:00
Eva H c000afe76c
doc: update vscode doc (#15064)
---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
2026-03-26 13:45:48 -07:00
Jesse Gross 9d7b18f81e mlxrunner: combine setStateRaw and setStateDetached into setState 2026-03-26 13:32:11 -07:00
Jesse Gross 4f5999fd3f mlxrunner: schedule periodic snapshots during prefill
Add periodic snapshots every 8k tokens and near the end of the prompt
so that long prompts can be partially restored and thinking/generation
can be retried without full reprocessing.
2026-03-26 13:32:11 -07:00
Jesse Gross ac5f0dbb6a mlxrunner: improve eviction and LRU tracking
Update LRU last used time just on the nodes that actually used
during processing rather than all snapshots along the path. This
allows eviction to remove nodes more accurately so we can avoid
other heuristics to auto-merge nodes.
2026-03-26 13:32:11 -07:00
Jesse Gross d1151e18a1 mlx: fix KV cache snapshot memory leak
mlx.Copy shares the backing buffer with its source (via
copy_shared_buffer) rather than allocating independent storage.
When used to snapshot a slice of the KV cache, the snapshot array
holds the entire original cache buffer alive through the shared
data pointer — even after eval detaches the computation graph.

Replace Copy with Contiguous in Snapshot and Split. Contiguous
allocates a compact buffer when the source buffer is significantly
larger than the logical slice (Contiguous::eval checks
buffer_size > nbytes + 16384), which is always the case for KV
cache slices.
2026-03-25 17:26:34 -07:00
rick ebbce136c7 ggml: force flash attention off for grok 2026-03-25 16:15:49 -07:00
Devon Rifkin 26b9f53f8e
api/show: overwrite basename for copilot chat (#15062)
Copilot Chat prefers to use `general.basename` in the built-in Ollama
integration, but this name isn't usually shown directly to users (and
there may be many models that share this name). Instead we pass back
`req.Model`, which for this extension is the value that we return from
`/api/tags`
2026-03-25 14:02:22 -07:00
Eva H 7575438366
cmd: ollama launch vscode (#15060)
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
2026-03-25 16:37:02 -04:00
Eva H 7d7c90d702
tui: add left arrow back navigation in model selector (#14940) 2026-03-25 11:53:48 -07:00
Daniel Hiltgen 4fda69809a
ci: fix windows cgo compiler error (#15046) 2026-03-24 16:45:36 -07:00
Daniel Hiltgen c9b5da6b0c
integration: improve ability to test individual models (#14948)
* integration: improve ability to test individual models

Add OLLAMA_TEST_MODEL env var to run integration tests against a
single model.

Enhance vision tests: multi-turn chat with cached image tokens, object
counting, spatial reasoning, detail recognition, scene understanding, OCR, and
multi-image comparison.

Add tool calling stress tests with complex agent-style prompts, large
system messages, and multi-turn tool response handling.

* review comments
2026-03-24 14:28:23 -07:00
Patrick Devine de5cb7311f
mlx: add mxfp4/mxfp8/nvfp4 importing (#15015)
This change allows importing bf16 and converting to mxfp4/mxfp8/nvfp4
and also importing fp8 and converting directly to mxfp8.
2026-03-24 13:45:44 -07:00
Jesse Gross 95ee7fbd29 mlxrunner: panic on double unpin 2026-03-23 17:44:19 -07:00
Jesse Gross ec55536734 mlxrunner: show time since last used in cache dump tree 2026-03-23 17:44:19 -07:00
Jesse Gross 77491439c2 mlxrunner: support partial match on pure transformer caches
Previously, a partial match within a node's edge would truncate the path
to the parent snapshot - effectively making all cache types behave as
recurrent caches. Caches with only transformer layers can rewind to
arbitrary boundary so this restores this capability to improve cache
hits
2026-03-23 17:44:19 -07:00
Parth Sareen b166b36cd2
docs: update Claude Code with Telegram guide (#15026) 2026-03-23 16:31:21 -07:00
Daniel Hiltgen c2b0bb7a52
mlx: update as of 3/23 (#14789)
* mlx: update to HEAD on 3/23

Also fixes a few misc vendoring bugs uncovered with this first update.
This also renames the version files to make them clearer.

* CUDA Fast Gated Delta kernel

* mlx: detect eval errors and panic

On model errors or missing kernels, don't mask the error, bubble it up.
2026-03-23 11:28:44 -07:00
Bruce MacDonald 22c2bdbd8a
docs: nemoclaw integration (#14962)
---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
2026-03-20 15:27:37 -07:00
Bruce MacDonald 6df6d097d9
launch: skip openclaw gateway health check when no daemon install (#14984) 2026-03-20 15:20:14 -07:00
Jesse Gross d7c176ab91 llm, mlxrunner: fix done channel value consumed by first receiver
Receiving from a buffered chan error consumes the value, so only the
first caller (WaitUntilRunning, HasExited, or Close) sees the signal.
Subsequent receivers block or take the wrong branch. Replace with a
closed chan struct{} which can be received from any number of times,
and store the error in a separate field.
2026-03-19 17:44:28 -07:00
Jesse Gross 0ff7d724ff mlx: fix subprocess log deadlock
The stderr reader used bufio.Scanner which has a 64KB max line size.
If the subprocess wrote a line exceeding this limit, the scanner would
stop reading, the OS pipe buffer would fill, and the subprocess would
deadlock.

Replace the scanner with a statusWriter that wraps io.Copy. The writer
forwards all stderr to os.Stderr while capturing the last short line
(≤256 bytes) for error reporting, avoiding both the deadlock and the
need to buffer arbitrarily long lines.
2026-03-19 17:44:28 -07:00
Devon Rifkin 46cb7795e1
add ability to turn on debug request logging (#14106)
If `OLLAMA_DEBUG_LOG_REQUESTS` is set, then on server startup a temp
folder will be created. Upon any inference request, the body will be
logged to a file in this folder, as well as a small shell script to
"replay" the request using cURL.

This is just intended for debugging scenarios, not as something to turn
on normally.
2026-03-19 17:08:17 -07:00
Bruce MacDonald 126d8db7f3
parsers: robust xml tool repair (#14961)
Previous xml repair for glm was a good start, but we need to go further and repair any incorrect open or closing tags

Co-authored-by: Dongluo Chen <dongluo.chen@gmail.com>
2026-03-19 11:24:48 -07:00
Eva H 3f3a24b418
app: fix desktop app stuck loading when OLLAMA_HOST is an unspecified bind address (#14885) 2026-03-19 12:57:57 -04:00
Jesse Gross 96e36c0d90 mlxrunner: share KV cache across conversations with common prefixes
Enable multiple conversations to reuse cached computations when they
share token prefixes (e.g. the same system prompt). A prefix trie
tracks shared regions so switching between conversations only
recomputes tokens that diverge. Inactive conversation state is paged
from active GPU memory to other memory and restored on demand, with LRU
eviction to keep memory usage bounded.
2026-03-18 16:06:33 -07:00
Jesse Gross 6f8ddbb26b mlxrunner: fix Slice(0, 0) returning full dimension instead of empty
Slice used cmp.Or to resolve a zero stop value to the dimension size,
intended to support open-ended slices like a[i:]. This made Slice(0, 0)
indistinguishable from Slice(), so any slice with a zero stop would
silently include the entire dimension instead of being empty.

Replace cmp.Or with an explicit End sentinel and resolve negative
indices against the dimension size, matching Python/PyTorch semantics.
2026-03-18 16:06:33 -07:00
Eva H b5e7888414
cmd/launch: skip redundant config writes when model unchanged (#14941) 2026-03-18 17:36:52 -04:00
Parth Sareen eab4d22269
docs: update claude code and openclaw for web search (#14922) 2026-03-18 14:18:49 -07:00
Bruce MacDonald 5759c2d2d2
launch: fix openclaw not picking up newly selected model (#14943)
Sessions with a stale model field were not updated when the primary
changed, so the old model continued to be used.
2026-03-18 13:20:10 -07:00
Bruce MacDonald 42b1c2642b
docs: update minimax-m2.5 references to m2.7 (#14942) 2026-03-18 12:59:28 -07:00
Bruce MacDonald 727d69ddf3
tui: fix signin on headless Linux systems (#14627)
Defensively handle environments without a display server to ensure signin remains usable on headless VMs and SSH sessions.

- Skip calling xdg-open when neither DISPLAY nor WAYLAND_DISPLAY is set, preventing silent failures or unexpected browser handlers
- Render the signin URL as plain text instead of wrapping it in OSC 8 hyperlink escape sequences, which can be garbled or hidden by terminals that don't support them
2026-03-18 11:11:17 -07:00
Jesse Gross f622b0c5fc launch: disable claude attribution header to preserve KV cache
Claude Code sends an x-anthropic-billing-header that changes on every
request. This is embedded in the system prompt and consequently
breaks the KV cache for every request. Given the size of the prompts
that Claude Code usees, this has significant performance impact.
2026-03-17 20:48:03 -07:00
Bruce MacDonald 5d0000634c
cmd/launch: check for both npm and git before installing OpenClaw (#14888)
The OpenClaw installer requires git in addition to npm. Update the
dependency check to detect both and provide specific install guidance
for whichever dependencies are missing.
2026-03-17 18:20:05 -07:00
Parth Sareen 676d9845ba
launch: register websearch for openclaw (#14914) 2026-03-17 15:03:15 -07:00
Devon Rifkin e37a9b4c01
cloud_proxy: for the web_search legacy path, flush on newlines (#14897)
`WebSearchAnthropicWriter` expects a single object per write. The new
transparent proxy will instead send it whatever bytes it sees. This
cloud-model + local-orchestration + cloud-search is a temporary code
path, so instead of making the web search code more robust to this, I
put an adapter in the middle that will flush line-by-line to preserve
the old behavior.
2026-03-17 13:30:17 -07:00
Patrick Devine d727aacd04
mlx: quantized embeddings, fast SwiGLU, and runtime fixes (#14884)
Add QuantizedEmbedding and EmbeddingLayer interface so models can
use quantized embedding weights and expose tied output projections.
This change updates gemma3, glm4_moe_lite, llama, qwen3, and qwen3_5
to use the new interface.
2026-03-17 11:21:38 -07:00
Patrick Devine fa69b833cd
mlx: add prequantized tensor packing + changes for qwen35 (#14878)
This change adds a tensorImportTransform interface for model-specific
tensor transformations during safetensors import. This allows importing
and modifying the standard HF based weights as well as the mlx-community
derived pre-quantized safetensors repos to be directly
imported into `ollama create`. Right now this only works with Qwen3.5
importing which does tensor renaming, norm weight shifting (it
adds +1 to each value of the norm vectors), conv1d transposition,
and casts to BF16s for F32 based vectors.
2026-03-17 11:21:18 -07:00