ollama

mirror of https://github.com/ollama/ollama synced 2026-04-23 08:45:14 +00:00

Author	SHA1	Message	Date
Jesse Gross	24e038d56a	mlxrunner: add logprobs support Match the ollamarunner and OpenAI semantics: raw, full-vocab log-softmax with the top-K ranked by probability. Skipped on the GPU when the request doesn't ask for logprobs so decode doesn't pay for it otherwise.	2026-04-20 17:43:00 -07:00
Daniel Hiltgen	30fdd229a4	create: Clean up experimental paths, fix create from existing safetensor model (#14679 ) * create: Clean up experimental paths This cleans up the experimental features, and adds both unit and integration test coverage to verify no regressions. * create: preserve config and layer names when creating from safetensors models When creating a model FROM an existing safetensors model, ModelFormat, Capabilities, and layer Name fields were lost. ModelFormat stayed empty because it's only set from GGML layers (which safetensors models lack), and layer names weren't copied in parseFromModel. This caused derived models to fail loading ("config.json not found in manifest"). * review comments	2026-04-07 08:12:57 -07:00
Daniel Hiltgen	96b202d34b	Add support for gemma4 (#15214 ) * bench: add prompt calibration, context size flag, and NumCtx reporting Add --num-ctx flag to set context size, and report NumCtx in model info header. Calibrate tokens-per-word ratio during warmup using actual tokenization metrics from the model, replacing the fixed 1.3 heuristic. This produces more accurate prompt token counts for --prompt-tokens. Also add fetchContextLength() to query running model context via /api/ps. * integration: improve vision test robustness and add thinking tests Add skipIfNoVisionOverride() to skip vision tests when OLLAMA_TEST_MODEL is set to a non-vision model. Add Think:false to context exhaustion test to prevent thinking models from using all context before the test can measure it. Add third test image (ollama homepage) and replace OCR test with ImageDescription test using it. Relax match strings for broader model compatibility. Add TestThinkingEnabled and TestThinkingSuppressed to verify thinking output and channel tag handling. * gemma4: add Gemma 4 GGML model support Add full Gemma 4 model family support (E2B, E4B, 26B MoE, 31B Dense) for the GGML backend including text, vision, converter, parser, and renderer. Text model features: - Sliding window + full attention with per-layer patterns - KV sharing across layers with donor map - Per-layer embeddings (PLE) with learned projections - MoE routing with RMSNorm + learned scale - Proportional RoPE with freq_factors for global attention - Final logit softcapping Vision model features: - SigLIP vision encoder with 2D RoPE - ClippableLinear with input/output clamping via packed v.clamp_data - Adaptive average pooling with nMerge kernel - Multi-modal projection with unweighted RMSNorm Converter: - Safetensors to GGUF with vision tensor renaming - Fused MoE gate_up_proj splitting - Vision patch embedding reshape (HF to Conv2D layout) - Packed clamp data tensor for ClippableLinear bounds - Proportional RoPE freq_factors generation Also includes: - BackendGet() on ml.Tensor for reading weight tensor data - Q6_K CUDA get_rows kernel support - MoE-aware ffn_down quantization layer counting - Gemma4 parser with tool calling and thinking support - Gemma4 renderer with structured tool format - Architecture-based auto-detection of renderer/parser/stop tokens - Integration test gemma4 model list additions * gemma4: add audio support with USM conformer encoder Add audio encoding for Gemma 4 using the USM conformer architecture: - Converter: audio tensor mapping, SSCP/conformer/embedder name replacements, softplus repacker for per_dim_scale, F32 enforcement for conv weights - GGML backend: Conv1DDW and PadExt tensor ops - Audio encoder: SSCP Conv2D, 12 conformer blocks (FFW + block-local attention with relative position embeddings + LightConv1d + FFW), output projection, audio-to-text embedding projector - Audio preprocessing: WAV decode, mel spectrogram, FFT (pure Go) - Model wiring: WAV detection, audio token handling, unified PostTokenize Correctly transcribes "why is the sky blue" from test audio. * integration: add gemma4 audio tests including OpenAI API coverage Test audio transcription and response via the Ollama native API, plus two new tests exercising the OpenAI-compatible endpoints: - /v1/audio/transcriptions (multipart form upload) - /v1/chat/completions with input_audio content type All tests use capability checks and skip models without audio support. * gemma4: add OpenAI audio API support and capability detection - Add CapabilityAudio and detect from audio.block_count in GGUF - Add /v1/audio/transcriptions endpoint with TranscriptionMiddleware - Add input_audio content type support in /v1/chat/completions - Add TranscriptionRequest/Response types in openai package * gemma4: add audio input support for run command - /audio toggle in interactive mode for voice chat - Platform-specific microphone recording (AVFoundation on macOS, PulseAudio/ALSA on Linux, WASAPI on Windows) - Space to start/stop recording, automatic chunking for long audio * gemma4: add transcribe command (ollama transcribe MODEL) - Interactive mode with readline prompt and slash commands - Non-interactive mode for piped audio or record-until-Ctrl+C - Chunked streaming transcription for long recordings - Word-wrapped output matching run command style * gemma4: add parser, renderer, and integration test plumbing * gemma4: fix renderer to emit BOS token * gemma4: add OpenAI audio transcription API and input_audio support * gemma4: update converter for new weight drop naming * gemma4: add per_expert_scale to MoE router and fix moe_intermediate_size config * gemma4: rewrite renderer to match HF Jinja2 template exactly Fix 8 bugs found by building 55 reference tests verified against the HF Jinja2 chat template (VERIFY_JINJA2=1 shells out to Python): - Tool responses use separate <\|turn>tool turns (not inline tags) - Tool calls emitted before content in assistant messages - Thinking content stripped from assistant history (strip_thinking) - User, tool, and system content trimmed (template does \| trim) - Empty system message still emits system turn (check role, not content) - Nested object properties rendered recursively with required field - Array items specification rendered for array-type properties - OBJECT/ARRAY type-specific rendering comma logic matches template Also adds Required field to api.ToolProperty for nested object schemas, replaces old gemma4_test.go with comprehensive gemma4_reference_test.go, and commits the Jinja2 template as testdata for verification. * gemma4: fix MoE fused gate_up split and multiline tool-call arg parsing - Text MoE: split `ffn_gate_up_exps` into contiguous `[gate\|up]` halves instead of stride-2 slices. - Parser: escape control characters in `<\|"\|>...<\|"\|>` string literals when converting tool-call args to JSON. - Fixes warnings like `invalid character '\n' in string literal` for multiline tool arguments. - Add Gemma4 parser regressions for multiline tool-call args and `gemma4ArgsToJSON`. * cmd: simplify audio input to dropped file attachments * gemma4: use full SWA memory for better cache reuse * gemma4: initialize clamps after backend load * convert: align gemma4 audio tensor renames with llama.cpp * Remove redundant comments in gemma4 vision model * Format Gemma4 MoE block field alignment * use 4096 kvcache.NewSWAMemCache * convert: support new Gemma4 audio_tower tensor naming (#15221) Co-authored-by: jmorganca <jmorganca@gmail.com> * fix integration test defaults for audio * review comments and lint fixes * remove unused audio/video files --------- Co-authored-by: jmorganca <jmorganca@gmail.com>	2026-04-02 11:33:33 -07:00
Daniel Hiltgen	c9b5da6b0c	integration: improve ability to test individual models (#14948 ) * integration: improve ability to test individual models Add OLLAMA_TEST_MODEL env var to run integration tests against a single model. Enhance vision tests: multi-turn chat with cached image tokens, object counting, spatial reasoning, detail recognition, scene understanding, OCR, and multi-image comparison. Add tool calling stress tests with complex agent-style prompts, large system messages, and multi-turn tool response handling. * review comments	2026-03-24 14:28:23 -07:00
Jesse Gross	c61023f554	ollamarunner: Fix off by one error with numPredict When numPredict is set, the user will receive one less token than the requested limit. In addition, the stats will incorrectly show the number of tokens returned as the limit. In cases where numPredict is not set, the number of tokens is reported correctly. This occurs because numPredict is checked when setting up the next batch but hitting the limit will terminate the current batch as well. Instead, is is better to check the limit as we actually predict them.	2026-02-04 17:14:24 -08:00
Jeffrey Morgan	b1fccabb34	Revert "Update vendored llama.cpp to b7847" (#14061 )	2026-02-03 18:39:36 -08:00
Jeffrey Morgan	ef00199fb4	Update vendor ggml code to a5bb8ba4 (#13832 ) Co-authored-by: Daniel Hiltgen <daniel@ollama.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2026-02-02 17:31:59 -08:00
Daniel Hiltgen	ae78112c50	test: add lfm2.5-thinking coverage (#13802 )	2026-01-20 12:57:02 -08:00
Jeffrey Morgan	31085d5e53	fix: use api.GenerateRequest for image generation test (#13793 ) Remove non-existent x/imagegen/api import and use the standard api.GenerateRequest/GenerateResponse with the Image field instead.	2026-01-20 03:23:31 -08:00
Daniel Hiltgen	c42e9d244f	test: add image gen test case (#13698 ) * test: fix type regression in tools test. * test: add image gen integration test	2026-01-19 16:01:31 -08:00
Gyungrai Wang	55d0b6e8b9	integration: fix tools_test.go for ToolCallFunctionArguments API change (#13731 )	2026-01-15 16:08:09 -08:00
Devon Rifkin	e51dead636	preserve tool definition and call JSON ordering (#13525 ) * preserve tool definition and call JSON ordering This is another iteration of <https://github.com/ollama/ollama/pull/12518>, but this time we've simplified things by relaxing the competing requirements of being compatible AND order-preserving with templates (vs. renderers). We maintain backwards compatibility at the cost of not guaranteeing order for templates. We plan on moving more and more models to renderers, which have been updated to use these new data types, and additionally we could add an opt-in way of templates getting an order-preserved list (e.g., via sibling template vars) * orderedmap_test: remove testify	2026-01-05 18:03:36 -08:00
nicole pardal	3475d915cb	embeddings: modified batch size (#13429 ) This PR detects embedding models and sets batch_size = context_size so the full input fits in a single batch. Previously, if batch size was smaller than the input, tokens could be split across batches and cause a SIGTRAP crash. This change ensures all tokens stay in one batch and prevents crashes. Fixes: #12938 #13054 Co-authored-by: Jesse Gross <jesse@ollama.com>	2025-12-11 15:36:31 -08:00
nicole pardal	e082d60a24	truncation: fixed runner truncation logic + removed server truncation (#12839 ) This PR consolidates all embedding prompt-length checking, truncation, and prompt token counting into the runner to ensure a single source of truth.	2025-12-08 11:20:28 -08:00
Daniel Hiltgen	18b5958d46	test: avoid ministral tools test on low vram (#13302 ) Avoid hitting test timeouts	2025-12-02 13:18:55 -08:00
Daniel Hiltgen	d771043e88	test: add ministral-3 (#13300 )	2025-12-02 09:52:16 -08:00
Parth Sareen	c114987523	logprob: add bytes to logprobs (#13068 )	2025-11-13 13:49:25 -08:00
Baptiste Jamin	59241c5bee	server: add logprobs and top_logprobs support to Ollama's API (#12899 ) Adds logprobs support to Ollama's API including support for Ollama's OpenAI-compatible API. By specifying the new 'logprobs' boolean parameter in the API, Ollama will return the log probabilities for each token generated. 'top_logprobs', an integer value can also be specified up to the value 20. When specified, the API will also provide the number of most likely tokens to return at each token position Co-authored-by: Baptiste Jamin <baptiste@crisp.chat>	2025-11-11 08:49:50 -08:00
nicole pardal	7dd4862a89	embeddings: removed redundant TestAPIEmbeddings test (#12863 ) This PR removes a redundant test from TestAPIEmbeddings Contents of this test already exists in embed_test.go and model_arch_test.go	2025-10-30 17:12:33 -07:00
Patrick Devine	76eb7d0fff	testing: test more models with tool calling (#12867 )	2025-10-30 13:19:21 -07:00
Daniel Hiltgen	c88647104d	int: harden server lifecycle (#12835 ) this should reduce zombies during integration runs	2025-10-29 11:50:56 -07:00
Patrick Devine	05aff4a4f1	tests: fix embeddinggemma integration test (#12830 )	2025-10-29 11:07:28 -07:00
Michael Yang	7d25b9e194	feat(model): add qwen3vl (#12665 )	2025-10-28 17:39:47 -07:00
Patrick Devine	36d64fb531	embed: add distance correlation test for library embed models (#12796 )	2025-10-28 16:57:27 -07:00
Patrick Devine	29f63f37c8	Revert "server: Consolidate embedding truncation in runner (#12730 )" (#12810 ) This reverts commit `5d347f6d6f`.	2025-10-28 14:49:14 -07:00
nicole pardal	5d347f6d6f	server: Consolidate embedding truncation in runner (#12730 ) Currently, checking the length of prompts for embeddings to ensure they fit in the context window (and possible truncation) occurs in two places - the Ollama server and runner. This can lead to inconsistencies in both the checks and reported number of tokens processed. Since we have to do this processing in the runner, this consolidates all of the logic there.	2025-10-27 11:59:12 -07:00
Jeffrey Morgan	5fe7ba1b9b	runner: always truncate embeddings requests (#12714 )	2025-10-20 16:47:05 -07:00
Daniel Hiltgen	68e04c7ff8	test: harden scheduler tests (#12662 ) * test: harden scheduler tests This removes reschedDelay which was stale code, and adds a new configurable timeout for the waitForVRAMRecovery so tests can now set the timeout to be very short to avoid the scheduler getting stuck and hitting a test timeout. * test: tune tests for partial loads Give stress tests more time when the model is split between CPU/GPU	2025-10-17 08:56:44 -07:00
Daniel Hiltgen	b531777a66	test: add a few missing embedding models (#12661 )	2025-10-16 09:36:25 -07:00
Daniel Hiltgen	4e5d862ec4	Integration test tuning (#12492 ) Remove some flaky scenarios, and switch to chat for better reliability	2025-10-08 09:51:25 -07:00
Daniel Hiltgen	c68f367ef6	Update GGML to b6646 (#12245 ) Notable EOLs with this change: - MacOS v12 and v13 are no longer supported (v14+ required) - AMD gfx900 and gfx906 are no longer supported	2025-10-02 14:47:10 -07:00
Daniel Hiltgen	bc8909fb38	Use runners for GPU discovery (#12090 ) This revamps how we discover GPUs in the system by leveraging the Ollama runner. This should eliminate inconsistency between our GPU discovery and the runners capabilities at runtime, particularly for cases where we try to filter out unsupported GPUs. Now the runner does that implicitly based on the actual device list. In some cases free VRAM reporting can be unreliable which can leaad to scheduling mistakes, so this also includes a patch to leverage more reliable VRAM reporting libraries if available. Automatic workarounds have been removed as only one GPU leveraged this, which is now documented. This GPU will soon fall off the support matrix with the next ROCm bump. Additional cleanup of the scheduler and discovery packages can be done in the future once we have switched on the new memory management code, and removed support for the llama runner.	2025-10-01 15:12:32 -07:00
Daniel Hiltgen	c23e6f4cae	tests: add single threaded history test (#12295 ) * tests: add single threaded history test Also tidies up some existing tests to handle more model output variation * test: add support for testing specific architectures	2025-09-22 11:23:14 -07:00
Michael Yang	ceac416ec2	fix(integration): check truncated length (#12337 )	2025-09-18 14:00:21 -07:00
Daniel Hiltgen	44a6792873	tests: tighten up a few flaky tests (#12271 ) Sometimes the context test results are pure emoji's Thanksgiving has too much variability, so swap for a more straight forward prompt.	2025-09-12 13:59:34 -07:00
Parth Sareen	20b53eaa72	tests: add tool calling integration test (#12232 )	2025-09-09 14:01:11 -07:00
Daniel Hiltgen	6745182885	tests: reduce stress on CPU to 2 models (#12161 ) * tests: reduce stress on CPU to 2 models This should avoid flakes due to systems getting overloaded with 3 (or more) models running concurrently * tests: allow slow systems to pass on timeout If a slow system is still streaming a response, and the response will pass validation, don't fail just because the system is slow. * test: unload embedding models more quickly	2025-09-09 09:32:15 -07:00
Jesse Gross	e119783e66	llm: Clamp batch size to context size The context must always be able to store the current batch, so if the user requests a small context then we should also shrink the batch to match. This also fixes the TestLongInputContext test on the new engine. (The old engine already has this behavior.)	2025-09-08 20:40:11 -07:00
Daniel Hiltgen	517807cdf2	perf: build graph for next batch async to keep GPU busy (#11863 ) * perf: build graph for next batch in parallel to keep GPU busy This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the next batch of work. * tests: tune integration tests for ollama engine This tunes the integration tests to focus more on models supported by the new engine.	2025-08-29 14:20:28 -07:00
Daniel Hiltgen	d6f7233a1c	test: improve scheduler/concurrency stress tests (#11906 ) * test: improve scheduler/concurrency stress tests The scheduler test used to use approximate memory figures and would often over or under shoot a systems capcity leading to flaky test results. This should improve the reliability of this scenario by leveraging ps output to determinie exactly how many models it takes to trigger thrashing. The concurrency test is also refined to target num_parallel + 1 and handle timeouts better. With these refinements, TestMultiModelConcurrency was redundant * test: add parallel generate with history TestGenerateWithHistory will help verify caching and context are properly handled while making requests * test: focus embed tests on embedding models remove non-embedding models from the embedding tests	2025-08-15 14:37:54 -07:00
Daniel Hiltgen	c385ca8672	test: add valid responses (#11902 ) some of the new models need a few more valid responses to pass	2025-08-14 11:07:13 -07:00
Daniel Hiltgen	a24f90604f	int: adjust a few models for integration tests (#11872 )	2025-08-13 15:42:36 -07:00
Daniel Hiltgen	114c3f2265	tests: add integration coverage for oss-gpt (#11696 ) Also wires up support to override the default "smol" model	2025-08-07 15:06:57 -07:00
Daniel Hiltgen	f8a6e88819	Only load supported models on new engine (#11362 ) * Only load supported models on new engine Verify the model is supported before trying to load * int: testcase for all library models	2025-07-11 12:21:54 -07:00
Daniel Hiltgen	4f473e224c	int: add performance integration tests (#11173 ) usage example: go test --tags=integration,perf -count 1 ./integration -v -timeout 1h -run TestModelsPerf 2>&1 \| tee int.log cat int.log \| grep MODEL_PERF_HEADER \| cut -f2- -d: > perf.csv cat int.log \| grep MODEL_PERF_DATA \| cut -f2- -d: >> perf.csv	2025-07-05 16:07:09 -07:00
Daniel Hiltgen	f2527b08fb	int: add coverage for older models (#11137 ) Verified these fail on 0.9.1 and pass on HEAD.	2025-06-19 12:10:19 -07:00
Daniel Hiltgen	2307fc2bcd	tests: drop llama3.2-vision embedding tests (#10837 )	2025-05-24 13:17:53 -07:00
Daniel Hiltgen	fdd4d479a3	integration: add qwen2.5-vl (#10815 ) Replace the older llava model with qwen2.5 for vision tests Skip split-batch test on small VRAM systems to avoid excessive test time	2025-05-22 09:12:32 -07:00
Daniel Hiltgen	424810450f	Move quantization to new backend (#10363 ) * Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.	2025-05-06 11:20:48 -07:00
湛露先生	7e5c8eee5c	file close check and close. (#10554 ) Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>	2025-05-04 15:37:59 -07:00

1 2

92 commits