Get the beginning-of-sequence token ID.
Get the end-of-sequence token ID.
Get the unknown token ID.
Get vocabulary size.
Decode token IDs back to text.
Encode text into token IDs using Viterbi algorithm.
Finds the most likely segmentation by computing the best path through all possible segmentations, where scores are log probabilities.
Staticfrom
SentencePiece Unigram tokenizer.
This implements the Viterbi-based unigram language model tokenization algorithm used by SentencePiece. It finds the most likely segmentation of input text based on learned piece scores (log probabilities).
Uses a trie for efficient O(n * maxPieceLen) vocabulary lookup.