Construct a new BPE encoding.
Can be overridden to change behavior of encode().
Can be overridden to change behavior of decode().
Can be overridden to change behavior of encode().
Decode tokens into a string.
May be lossy if the tokens output bytes that don't correspond to a valid UTF-8 string.
Decode tokens into a byte array (may not be UTF-8).
Encode a text string into tokens, optionally supporting special tokens.
OptionalallowedSpecial: Set<string>Encode text with all special tokens allowed.
Retrieve a list of special tokens in this encoding.
Byte-pair encoding tokenizer, based on the Tiktoken library.
This handles special tokens and correctly merges adjacent pairs in order of priority (lowest ranks first). This is enough to support LLMs, although some models like CLIP have particular behavior (BOS/EOS, padding, case-insensitivity, whitespace) implemented in subclasses.
The internals of this class work in hex strings instead of Uint8Array because strings are more optimized in JavaScript.