mirror of https://github.com/ollama/ollama synced 2026-04-23 08:45:14 +00:00

History

Jesse Gross c61023f554 ollamarunner: Fix off by one error with numPredict When numPredict is set, the user will receive one less token than the requested limit. In addition, the stats will incorrectly show the number of tokens returned as the limit. In cases where numPredict is not set, the number of tokens is reported correctly. This occurs because numPredict is checked when setting up the next batch but hitting the limit will terminate the current batch as well. Instead, is is better to check the limit as we actually predict them.		2026-02-04 17:14:24 -08:00
..
common	server: add logprobs and top_logprobs support to Ollama's API (#12899 )	2025-11-11 08:49:50 -08:00
llamarunner	flash attn: add auto mode for llama engine (#13052 )	2025-12-12 13:27:19 -08:00
ollamarunner	ollamarunner: Fix off by one error with numPredict	2026-02-04 17:14:24 -08:00
README.md	Runner for Ollama engine	2025-02-13 17:09:26 -08:00
runner.go	glm 4.7 flash support on experimental engine (#13838 )	2026-02-02 15:22:11 -08:00

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding