jax-js
    Preparing search index...

    SentencePiece Unigram tokenizer.

    This implements the Viterbi-based unigram language model tokenization algorithm used by SentencePiece. It finds the most likely segmentation of input text based on learned piece scores (log probabilities).

    Uses a trie for efficient O(n * maxPieceLen) vocabulary lookup.

    Constructors

    Accessors

    • get bosToken(): number

      Get the beginning-of-sequence token ID.

      Returns number

    • get eosToken(): number

      Get the end-of-sequence token ID.

      Returns number

    • get unkToken(): number

      Get the unknown token ID.

      Returns number

    • get vocabSize(): number

      Get vocabulary size.

      Returns number

    Methods

    • Decode token IDs back to text.

      Parameters

      • tokens: number[]

      Returns string

    • Encode text into token IDs using Viterbi algorithm.

      Finds the most likely segmentation by computing the best path through all possible segmentations, where scores are log probabilities.

      Parameters

      • text: string

      Returns number[]