jax-js
    Preparing search index...

    Byte-pair encoding tokenizer, based on the Tiktoken library.

    This handles special tokens and correctly merges adjacent pairs in order of priority (lowest ranks first). This is enough to support LLMs, although some models like CLIP have particular behavior (BOS/EOS, padding, case-insensitivity, whitespace) implemented in subclasses.

    The internals of this class work in hex strings instead of Uint8Array because strings are more optimized in JavaScript.

    Constructors

    • Construct a new BPE encoding.

      Parameters

      • encoder: Map<string, number>
      • specialTokens: Record<string, number>
      • regex: RegExp

      Returns BpeEncoding

    Properties

    decoder: Map<number, string>
    encoder: Map<string, number>
    regex: RegExp
    specialRegex: RegExp
    specialTokensDecoder: Map<number, string>
    specialTokensEncoder: Map<string, number>

    Methods

    • Can be overridden to change behavior of encode().

      Parameters

      • tokens: number[]

      Returns number[]

    • Can be overridden to change behavior of decode().

      Parameters

      • tokens: number[]

      Returns number[]

    • Can be overridden to change behavior of encode().

      Parameters

      • text: string

      Returns string

    • Decode tokens into a string.

      May be lossy if the tokens output bytes that don't correspond to a valid UTF-8 string.

      Parameters

      • tokens: number[]

      Returns string

    • Decode tokens into a byte array (may not be UTF-8).

      Parameters

      • tokens: number[]

      Returns Uint8Array

    • Encode a text string into tokens, optionally supporting special tokens.

      Parameters

      • text: string
      • OptionalallowedSpecial: Set<string>

      Returns number[]

    • Encode text with all special tokens allowed.

      Parameters

      • text: string

      Returns number[]

    • Retrieve a list of special tokens in this encoding.

      Returns Set<string>