mirror of
https://github.com/ollama/ollama
synced 2026-04-23 08:45:14 +00:00
* tokenizer: add byte fallback for SentencePiece BPE encoding When BPE merging produces tokens not in the vocabulary, fall back to encoding each UTF-8 byte as <0xHH> byte tokens instead of silently dropping the character. Also teach Decode to convert <0xHH> tokens back to raw bytes. Fixes #15229, fixes #15231 * tokenizer fixes |
||
|---|---|---|
| .. | ||
| testdata | ||
| bytepairencoding.go | ||
| bytepairencoding_test.go | ||
| sentencepiece.go | ||
| sentencepiece_test.go | ||
| tokenizer.go | ||
| vocabulary.go | ||
| vocabulary_test.go | ||
| wordpiece.go | ||
| wordpiece_test.go | ||