ollama/tokenizer
Daniel Hiltgen cb0033598e
tokenizer: add SentencePiece-style BPE support (#15162)
* tokenizer: add SentencePiece-style BPE support

Add WithSentencePieceNormalizer option to BytePairEncoding for models
that use BPE with SentencePiece-style space markers (space to/from
U+2581).

NewBytePairEncoding is unchanged; the new NewBytePairEncodingWithOptions
constructor accepts BPEOption functions. Decoding handles the reverse
mapping of U+2581 back to spaces.

* review comments
2026-03-31 17:00:36 -07:00
..
testdata move tokenizers to separate package (#13825) 2026-02-05 17:44:11 -08:00
bytepairencoding.go tokenizer: add SentencePiece-style BPE support (#15162) 2026-03-31 17:00:36 -07:00
bytepairencoding_test.go tokenizer: add SentencePiece-style BPE support (#15162) 2026-03-31 17:00:36 -07:00
sentencepiece.go move tokenizers to separate package (#13825) 2026-02-05 17:44:11 -08:00
sentencepiece_test.go move tokenizers to separate package (#13825) 2026-02-05 17:44:11 -08:00
tokenizer.go move tokenizers to separate package (#13825) 2026-02-05 17:44:11 -08:00
vocabulary.go move tokenizers to separate package (#13825) 2026-02-05 17:44:11 -08:00
vocabulary_test.go move tokenizers to separate package (#13825) 2026-02-05 17:44:11 -08:00
wordpiece.go move tokenizers to separate package (#13825) 2026-02-05 17:44:11 -08:00
wordpiece_test.go move tokenizers to separate package (#13825) 2026-02-05 17:44:11 -08:00