mirror of https://github.com/ollama/ollama synced 2026-04-23 08:45:14 +00:00

History

easonysliu 810d4f9c22 runner: fix swallowed error in allocModel graph reservation In allocModel(), the first call to reserveWorstCaseGraph(true) had its error silently discarded — `return nil` was used instead of `return err`. This meant that if the prompt-sized graph reservation failed (e.g. due to insufficient memory), the error was swallowed, allocModel reported success, and the model appeared to load correctly. Subsequent inference would then fail in unexpected ways because the worst-case graph was never properly reserved. Fix: return the actual error so the caller can handle the failure (retry with reduced parallelism, report OOM, etc.). Co-Authored-By: Claude (claude-opus-4-6) <noreply@anthropic.com>		2026-03-16 15:48:45 -07:00
..
common	server: add logprobs and top_logprobs support to Ollama's API (#12899 )	2025-11-11 08:49:50 -08:00
llamarunner	flash attn: add auto mode for llama engine (#13052 )	2025-12-12 13:27:19 -08:00
ollamarunner	runner: fix swallowed error in allocModel graph reservation	2026-03-16 15:48:45 -07:00
README.md	Runner for Ollama engine	2025-02-13 17:09:26 -08:00
runner.go	Add MLX runner with GLM4-MoE-Lite model support (#14185 )	2026-02-10 14:57:57 -08:00

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding