ollama/model
Jeffrey Morgan 1044b0419a
model: add MLA absorption for glm4moelite (#13810)
* model: add MLA absorption for glm4moelite

Split the combined KV_B tensor into separate K_B and V_B tensors
during conversion, enabling MLA (Multi-head Latent Attention)
absorption which compresses the KV cache for improved efficiency.

* ggml: enable MLA flash attention for GLM-4.7-flash

Add support for gqa_ratio 4 in MLA flash attention kernels. GLM-4.7-flash
uses head size 576 with gqa_ratio 4, which was previously only supported
for gqa_ratio 16 (DeepSeek).

Metal changes:
- Enable head size 576 for flash attention
- Increase simdgroups to 8 for large heads (>=512)
- Add case 8 kernel dispatch for 8 simdgroups

CUDA changes:
- Add gqa_ratio 4 support for head 576/512
- Add tile configs for (576, 512, 4) and (576, 512, 8)
- Add MMA config cases for ncols 4
- Add template instances for ncols2=4

* model: add compatibility validation for glm4moelite architecture
2026-01-23 14:47:42 -08:00
..
imageproc deepseekocr 2025-11-18 16:11:37 -08:00
input batch: use tensors for outputs (#12185) 2025-09-15 14:33:06 -07:00
models model: add MLA absorption for glm4moelite (#13810) 2026-01-23 14:47:42 -08:00
parsers model: add lfm2 architecture and LFM2.5-1.2B-Thinking support (#13792) 2026-01-20 12:20:53 -08:00
renderers model: add lfm2 architecture and LFM2.5-1.2B-Thinking support (#13792) 2026-01-20 12:20:53 -08:00
testdata gemma2 impl 2025-03-11 14:35:08 -07:00
bytepairencoding.go remove unnecessary code (#13502) 2025-12-16 15:11:26 -08:00
bytepairencoding_test.go refactor: using testing.B.Loop 2025-10-10 13:25:29 -07:00
model.go model: add MLA absorption for glm4moelite (#13810) 2026-01-23 14:47:42 -08:00
model_test.go fix: leaf alt name (#12390) 2025-09-23 17:50:53 -07:00
sentencepiece.go fix(tokenizer): add special tokens to empty inputs (#13091) 2025-11-18 11:16:56 -08:00
sentencepiece_test.go model: implement bert in ollama engine (#9080) 2025-09-15 15:35:59 -07:00
textprocessor.go model: handle multiple eos tokens (#10577) 2025-05-16 13:40:23 -07:00
vocabulary.go fix(tokenizer): add special tokens to empty inputs (#13091) 2025-11-18 11:16:56 -08:00
vocabulary_test.go fix(tokenizer): add special tokens to empty inputs (#13091) 2025-11-18 11:16:56 -08:00
wordpiece.go nomic-embed-text model implementation (#13071) 2025-11-18 18:28:10 -08:00
wordpiece_test.go nomic-embed-text model implementation (#13071) 2025-11-18 18:28:10 -08:00