Commit graph

448 commits

Author SHA1 Message Date
Jeffrey Morgan 4eab60c1e2
Reapply "don't require pulling stubs for cloud models" again (#14608)
* Revert "Revert "Reapply "don't require pulling stubs for cloud models"" (#14606)"

This reverts commit 39982a954e.

* fix test + do cloud lookup only when seeing cloud models

---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
2026-03-06 14:27:47 -08:00
Parth Sareen 122c68c151
server: loosen thinking level constraint (#14625) 2026-03-04 13:42:18 -08:00
Jeffrey Morgan 39982a954e
Revert "Reapply "don't require pulling stubs for cloud models"" (#14606)
This reverts commit 799e51d419.
2026-03-03 20:56:10 -08:00
Patrick Devine 110eff01a9
chore: remove old imagegen LLMs models (#14597)
These models are implemented in the x/mlxrunner instead.
2026-03-03 13:23:40 -08:00
Jeffrey Morgan 799e51d419
Reapply "don't require pulling stubs for cloud models"
This reverts commit 97d2f05a6d.
2026-03-03 13:17:10 -08:00
Jeffrey Morgan 97d2f05a6d
Revert "don't require pulling stubs for cloud models (#14574)" (#14596)
This reverts commit 8207e55ec7.
2026-03-03 12:51:23 -08:00
Devon Rifkin 8207e55ec7
don't require pulling stubs for cloud models (#14574)
* don't require pulling stubs for cloud models

This is a first in a series of PRs that will better integrate Ollama's
cloud into the API and CLI. Previously we used to have a layer of
indirection where you'd first have to pull a "stub" model that contains
a reference to a cloud model. With this change, you don't have to pull
first, you can just use a cloud model in various routes like `/api/chat`
and `/api/show`. This change respects
<https://github.com/ollama/ollama/pull/14221>, so if cloud is disabled,
these models won't be accessible.

There's also a new, simpler pass-through proxy that doesn't convert the
requests ahead of hitting the cloud models, which they themselves
already support various formats (e.g., `v1/chat/completions` or Open
Responses, etc.). This will help prevent issues caused by double
converting (e.g., `v1/chat/completions` converted to `api/chat` on the
client, then calling cloud and converting back to a
`v1/chat/completions` response instead of the cloud model handling the
original `v1/chat/completions` request first).

There's now a notion of "source tags", which can be mixed with existing
tags. So instead of having different formats like`gpt-oss:20b-cloud` vs.
`kimi-k2.5:cloud` (`-cloud` suffix vs. `:cloud`), you can now specify
cloud by simply appending `:cloud`. This PR doesn't change model
resolution yet, but sets us up to allow for things like omitting the
non-source tag, which would make something like `ollama run
gpt-oss:cloud` work the same way that `ollama run gpt-oss` already works
today.

More detailed changes:

- Added a shared model selector parser in `types/modelselector`:
  - supports `:cloud` and `:local`
  - accepts source tags in any position
  - supports legacy `:<tag>-cloud`
  - rejects conflicting source tags
- Integrated selector handling across server inference/show routes:
  - `GenerateHandler`, `ChatHandler`, `EmbedHandler`,
    `EmbeddingsHandler`, `ShowHandler`
- Added explicit-cloud passthrough proxy for ollama.com:
  - same-endpoint forwarding for `/api/*`, `/v1/*`, and `/v1/messages`
  - normalizes `model` (and `name` for `/api/show`) before forwarding
  - forwards request headers except hop-by-hop/proxy-managed headers
  - uses bounded response-header timeout
  - handles auth failures in a friendly way
- Preserved cloud-disable behavior (`OLLAMA_NO_CLOUD`)
- Updated create flow to support `FROM ...:cloud` model sources (though
  this flow uses the legacy proxy still, supporting Modelfile overrides
  is more complicated with the direct proxy approach)
- Updated CLI/TUI/config cloud detection to use shared selector logic
- Updated CLI preflight behavior so explicit cloud requests do not
  auto-pull local stubs

What's next?

- Cloud discovery/listing and cache-backed `ollama ls` / `/api/tags`
- Modelfile overlay support for virtual cloud models on OpenAI/Anthropic
  request families
- Recommender/default-selection behavior for ambiguous model families
- Fully remove the legacy flow

Fixes: https://github.com/ollama/ollama/issues/13801

* consolidate pull logic into confirmAndPull helper

pullIfNeeded and ShowOrPull shared identical confirm-and-pull logic.
Extract confirmAndPull to eliminate the duplication.

* skip local existence checks for cloud models

ModelExists and the TUI's modelExists both check the local model list,
which causes cloud models to appear missing. Return true early for
explicit cloud models so the TUI displays them beside the integration
name and skips re-prompting the model picker on relaunch.

* support optionally pulling stubs for newly-style names

We now normalize names like `<family>:<size>:cloud` into legacy-style
names like `<family>:<size>-cloud` for pulling and deleting (this also
supports stripping `:local`). Support for pulling cloud models is
temporary, once we integrate properly into `/api/tags` we won't need
this anymore.

* Fix server alias syncing

* Update cmd/cmd.go

Co-authored-by: Parth Sareen <parth.sareen@ollama.com>

* address comments

* improve some naming

---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
2026-03-03 10:46:33 -08:00
Jesse Gross ad16bffc7d mlx: Remove peak memory from the API
This is still in flux so it is better to just log it for now.
2026-03-02 15:56:18 -08:00
Jesse Gross a16f96658b mlxrunner: Enforce model context limit
Currently, context length is unbounded - the cache will keep
growing forever independent of the model's trained context
length. This caps it and enforces semantics similar to most
cloud services:
 - Long prompts will result in an error, not truncation.
 - Generation that exceeds the context will be stopped
2026-02-27 17:29:47 -08:00
Jesse Gross 638faeac54 mlxrunner: Report actual memory usage from runner
The MLX runner previously reported a static VRAM estimate that was
computed at load time and consisted only of the weights. This is
strictly less than the actual memory usage, as it does not include
the KV cache or compute graph.
2026-02-27 17:29:47 -08:00
Patrick Devine 79917cf80b
show peak memory usage (#14485) 2026-02-26 18:38:27 -08:00
Patrick Devine d18dcd7775
mlxrunner fixes (#14247)
* load glm4_moe_lite from the mlxrunner

* fix loading diffusion models

* remove log lines

* fix --imagegen flag
2026-02-13 22:30:42 -08:00
Devon Rifkin 948de6bbd2
add ability to disable cloud (#14221)
* add ability to disable cloud

Users can now easily opt-out of cloud inference and web search by
setting

```
"disable_ollama_cloud": true
```

in their `~/.ollama/server.json` settings file. After a setting update,
the server must be restarted.

Alternatively, setting the environment variable `OLLAMA_NO_CLOUD=1` will
also disable cloud features. While users previously were able to avoid
cloud models by not pulling or `ollama run`ing them, this gives them an
easy way to enforce that decision. Any attempt to run a cloud model when
cloud is disabled will fail.

The app's old "airplane mode" setting, which did a similar thing for
hiding cloud models within the app is now unified with this new cloud
disabled mode. That setting has been replaced with a "Cloud" toggle,
which behind the scenes edits `server.json` and then restarts the
server.

* gate cloud models across TUI and launch flows when cloud is disabled

Block cloud models from being selected, launched, or written to
integration configs when cloud mode is turned off:

- TUI main menu: open model picker instead of launching with a
  disabled cloud model
- cmd.go: add IsCloudModelDisabled checks for all Selection* paths
- LaunchCmd: filter cloud models from saved Editor configs before
  launch, fall through to picker if none remain
- Editor Run() methods (droid, opencode, openclaw): filter cloud
  models before calling Edit() and persist the cleaned list
- Export SaveIntegration, remove SaveIntegrationModel wrapper that
  was accumulating models instead of replacing them

* rename saveIntegration to SaveIntegration in config.go and tests

* cmd/config: add --model guarding and empty model list fixes

* Update docs/faq.mdx

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

* Update internal/cloud/policy.go

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

* Update internal/cloud/policy.go

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

* Update server/routes.go

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

* Revert "Update internal/cloud/policy.go"

This reverts commit 8bff8615f9.

Since this error shows up in other integrations, we want it to be
prefixed with Ollama

* rename cloud status

* more status renaming

* fix tests that weren't updated after rename

---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
2026-02-12 15:47:00 -08:00
Michael Yang 6ddd8862cd
chore: move x/mlxrunner into x/imagegen (#14100) 2026-02-05 18:25:56 -08:00
Parth Sareen 5f53fe7884
cmd: ollama launch improvements (#14099) 2026-02-05 15:08:17 -08:00
Jeffrey Morgan cefabd79a8
Revert "cmd: claude launch improvements (#14064)" (#14071)
This reverts commit ee25219edd.
2026-02-04 09:10:37 -08:00
Parth Sareen ee25219edd
cmd: claude launch improvements (#14064) 2026-02-03 19:33:58 -08:00
Jesse Gross 0334ffa625 server: use tiered VRAM-based default context length
Replace binary low VRAM mode with tiered VRAM thresholds that set
default context lengths for all models:

- < 24 GiB VRAM: 4,096 context
- 24-48 GiB VRAM: 32,768 context
- >= 48 GiB VRAM: 262,144 context
2026-02-02 10:47:09 -08:00
Jesse Gross d11fbd2c60 server: fix ollama ps showing configured instead of actual context length
When context length is clamped to the model's trained context length,
ollama ps now shows the actual clamped value instead of the originally
configured value.
2026-02-02 10:47:09 -08:00
Jeffrey Morgan 862bc0a3bf
x/imagegen: respect stream=false in /api/generate (#13853)
When stream=false is set for image generation requests, return a single
JSON response instead of streaming multiple ndjson progress updates.
2026-01-22 22:16:39 -08:00
Jeffrey Morgan c01608b6a1
x/imagegen: add image edit capabilities (#13846) 2026-01-22 20:35:08 -08:00
Patrick Devine 148a1be0a3
Clean up the manifest and modelpath (#13807) 2026-01-21 11:46:17 -08:00
Devon Rifkin e98b5e8b4e
/api/show: default to empty model_info (#13785)
For `/api/show`, a fully missing `model_info` field trips up various
integrators (including a recent Android Studio integration).

The primary source of missing info tends to come from models with a
remote that are also missing other data. It seems better to me to return
an empty `model_info` than making up some other fields within
`model_info` (like saying the architecture is `remote` or something like
that). So this does slightly change `/api/show`'s behavior that possibly
someone is relying on, but it seems more important to ensure the field
is always there (from a quick sampling integrations seem to be robust to
missing fields _within_ it).

Fixes: https://github.com/ollama/ollama/issues/13783
2026-01-19 15:26:17 -08:00
Jeffrey Morgan 68e00c7c36
fix: prevent image generation models from loading during deletion (#13781)
Move the unload check (empty prompt + KeepAlive=0) before the image
generation model dispatch in GenerateHandler. This prevents models like
flux from being loaded into memory just to be immediately unloaded when
running `ollama rm`.

Also fix a bug in DeleteHandler where `args[0]` was used instead of
`arg` in the delete loop, causing only the first model to be unloaded
when deleting multiple models.
2026-01-19 12:48:34 -08:00
Jeffrey Morgan 634c416645
Add experimental image generation fields to /api/generate (#13753)
Request fields (experimental):
- width: image width (max 4096)
- height: image height (max 4096)
- steps: denoising steps
- seed: random seed

Response fields (experimental):
- images: base64-encoded generated images
- completed: current step progress
- total: total steps

Other changes:
- Fix lifecycle bug where image models wouldn't unload (refCount issue)
- Fix "headers already written" error on Ctrl+C during streaming
- Add gin middleware for OpenAI /v1/images/generations compatibility
- Update CLI to use /api/generate with progress bar
- Add preload support in interactive mode
2026-01-17 18:27:41 -08:00
Patrick Devine a077d996e3
Fix create and show commands for experimental models (#13741)
* x: make `ollama create --experimental` import from safetensors

This change allows pulling in safetensors models into the new experimental model format, and also
fixes the `ollama show` command to be able to correctly display the model information.

* gofumpt the linter

* gofumpt the linter again

* validate the model name
2026-01-16 14:31:55 -08:00
Jeffrey Morgan c23d5095de
x/imagegen: clean up image generation code (#13725) 2026-01-16 12:19:25 -08:00
Parth Sareen 35c3c9e3c2
anthropic: allow non-thinking models when using Anthropic API (#13692) 2026-01-12 15:13:26 -08:00
Jeffrey Morgan 9667c2282f
x/imagegen: add naive TeaCache and FP8 quantization support (#13683)
TeaCache:
- Timestep embedding similarity caching for diffusion models
- Polynomial rescaling with configurable thresholds
- Reduces transformer forward passes by ~30-50%

FP8 quantization:
- Support for FP8 quantized models (8-bit weights with scales)
- QuantizedMatmul on Metal, Dequantize on CUDA
- Client-side quantization via ollama create --quantize fp8

Other bug fixes:
- Fix `/api/show` API for image generation models
- Server properly returns model info (architecture, parameters, quantization)
- Memory allocation optimizations
- CLI improvements for image generation
2026-01-12 13:45:22 -08:00
Jeffrey Morgan 2584940016
Add z-image image generation prototype (#13659) 2026-01-09 21:09:46 -08:00
Parth Sareen 2b2cda7a2b
api: implement anthropic api (#13600)
* api: add Anthropic Messages API compatibility layer

Add middleware to support the Anthropic Messages API format at /v1/messages.
This enables tools like Claude Code to work with Ollama local and cloud models through the
Anthropic API interface.
2026-01-09 11:53:36 -08:00
lif 37f6f3af24
server: return error when embedding contains NaN or Inf values (#13599)
The normalize function now checks for NaN and Inf values in the
embedding vector before processing. This prevents JSON encoding
failures when models produce invalid floating-point values.

Fixes #13572

Signed-off-by: majiayu000 <1835304752@qq.com>
2026-01-03 02:20:12 -05:00
Jeffrey Morgan 8852220f59
add REQUIRES command to Modelfile (#13361) 2025-12-18 13:21:29 -08:00
Bruce MacDonald 45c4739374
types: ConfigV2 and RootFS (#13504)
Refactored the ConfigV2 and RootFS types from server/images.go to a new types/model/config.go file under the model package. Updated all references to use model.ConfigV2 and model.RootFS. This allows for use in other projects without worrying about compiling the c code in the llama package.
2025-12-16 15:18:17 -08:00
Devon Rifkin 1eb5e75972
openai: add v1/responses support (#13351)
Only supporting the stateless part of the API.

Doc updates to come once this is shipped.

Closes: #9659
2025-12-11 15:37:10 -08:00
EasonLin 1c4e85b4df
routes: add logprobs in tool calls (#13238) 2025-12-10 17:28:41 -08:00
nicole pardal e082d60a24
truncation: fixed runner truncation logic + removed server truncation (#12839)
This PR consolidates all embedding prompt-length checking, truncation, and prompt token counting into the runner to ensure a single source of truth.
2025-12-08 11:20:28 -08:00
Sos Pogosyan 31b8c6a214
fix(api): correct Content-Type header for /api/chat and /api/generate when using cloud models (#13279)
---------

Co-authored-by: Pogosyan Sos <sos_pogosyan@MacBook-Pro-Sos.local>
Co-authored-by: Patrick Devine <patrick@infrahq.com>
2025-12-04 21:33:07 -08:00
Grace d70e935526
Parser for Cogito v2 (#13145) 2025-11-19 17:21:07 -08:00
Baptiste Jamin 59241c5bee
server: add logprobs and top_logprobs support to Ollama's API (#12899)
Adds logprobs support to Ollama's API including support for Ollama's
OpenAI-compatible API. By specifying the new 'logprobs' boolean parameter
in the API, Ollama will return the log probabilities for each token generated.
'top_logprobs', an integer value can also be specified up to the value 20.
When specified, the API will also provide the number of most likely tokens to
return at each token position

Co-authored-by: Baptiste Jamin <baptiste@crisp.chat>
2025-11-11 08:49:50 -08:00
Grace 809b9c68fa
Add Tool Call ID (#12956)
* routes/types: add tool call id

---------

Co-authored-by: ParthSareen <parth.sareen@ollama.com>
2025-11-04 16:43:33 -08:00
Michael Yang 7d25b9e194
feat(model): add qwen3vl (#12665) 2025-10-28 17:39:47 -07:00
Patrick Devine 29f63f37c8
Revert "server: Consolidate embedding truncation in runner (#12730)" (#12810)
This reverts commit 5d347f6d6f.
2025-10-28 14:49:14 -07:00
nicole pardal 5d347f6d6f
server: Consolidate embedding truncation in runner (#12730)
Currently, checking the length of prompts for embeddings to ensure
they fit in the context window (and possible truncation) occurs in
two places - the Ollama server and runner. This can lead to
inconsistencies in both the checks and reported number of tokens
processed. Since we have to do this processing in the runner, this
consolidates all of the logic there.
2025-10-27 11:59:12 -07:00
Patrick Devine b97eb2b858
cloud: set the proxy content-type to the same as local models (#12759) 2025-10-25 10:57:10 -07:00
Patrick Devine d515aed6c3
cloud: don't error sending empty messages (#12724) 2025-10-21 18:12:14 -07:00
Jeffrey Morgan 65fb3ff49d
renderers: add global flag for setting [img] tags (#12669)
Adds a temporary global flag to renderers that causes renderers to always
render images as [img]. In a follow up change, we will consider making this
the default, and this flag could eventually be removed
2025-10-16 16:37:32 -07:00
Jeffrey Morgan 6544e14735
Reapply "add truncate and shift parameters" (#12582) 2025-10-11 16:06:14 -07:00
Devon Rifkin 6db8da9958 routes: fix built-in renderers for api/generate
Made it so when api/generate builds up a message array and generates the
prompt it now goes through the same function as `api/chat` for
consistency. This is where we hook the optional built-in renderers to
bypass templates, which was missing for `api/generate` before this
change.

Closes: #12578
2025-10-11 13:57:43 -07:00
Patrick Devine d681cd7c29
thinking: allow "think": false for non-thinking models (#12555) 2025-10-09 18:46:00 -07:00