mirror of
https://github.com/ollama/ollama
synced 2026-04-26 10:14:33 +00:00
Use the original key dimension (qkNopeHeadDim + qkRopeHeadDim = 256) for the attention scale instead of the MLA absorbed dimension (kvLoraRank + qkRopeHeadDim = 576). MLA absorption is a mathematically equivalent reorganization of the attention computation - it should not change the effective attention scale. The scale should match training, which uses 1/sqrt(256). This improves tool calling and model looping issues. |
||
|---|---|---|
| .. | ||
| bert | ||
| deepseek2 | ||
| deepseekocr | ||
| gemma2 | ||
| gemma3 | ||
| gemma3n | ||
| glm4moelite | ||
| gptoss | ||
| lfm2 | ||
| llama | ||
| llama4 | ||
| mistral3 | ||
| mllama | ||
| nomicbert | ||
| olmo3 | ||
| qwen2 | ||
| qwen3 | ||
| qwen3vl | ||
| qwen25vl | ||
| models.go | ||