mindformers.models.LlamaTokenizerFast

View Source On Gitee
class mindformers.models.LlamaTokenizerFast(vocab_file=None, tokenizer_file=None, clean_up_tokenization_spaces=False, unk_token='<unk>', bos_token='<s>', eos_token='</s>', add_bos_token=True, add_eos_token=False, use_default_system_prompt=False, **kwargs)[source]

Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.

This uses notably ByteFallback and no normalization.

Note

Currently, the llama_tokenizer_fast process supports only the 'right' padding mode. padding_side = "right"

Note

If you want to change the bos_token or the eos_token, make sure to specify them when initializing the model, or call tokenizer.update_post_processor() to make sure that the post-processing is correctly done (otherwise the values of the first token and final token of an encoded sequence will not be correct).

Parameters
  • vocab_file (str, optional) – SentencePiece file (generally has a .model extension) that contains the vocabulary necessary to instantiate a tokenizer. Default: None .

  • tokenizer_file (str, optional) – Tokenizers file (generally has a .json extension) that contains everything needed to load the tokenizer. Default: None .

  • clean_up_tokenization_spaces (bool, optional) – Whether to clean-up spaces after decoding, cleanup consists in removing potential artifacts like extra spaces. Default: False .

  • unk_token (Union[str, tokenizers.AddedToken], optional) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. Default: "<unk>" .

  • bos_token (Union[str, tokenizers.AddedToken], optional) – The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. Default: "<s>" .

  • eos_token (Union[str, tokenizers.AddedToken], optional) – The end of sequence token. Default: "</s>" .

  • add_bos_token (bool, optional) – Whether to add an bos_token at the start of sequences. Default: True .

  • add_eos_token (bool, optional) – Whether to add an eos_token at the end of sequences. Default: False .

  • use_default_system_prompt (bool, optional) – Whether the default system prompt for Llama should be used. Default: False .

Returns

LlamaTokenizer, a LlamaTokenizer instance.

Examples

>>> from transformers import LlamaTokenizerFast
>>>
>>> tokenizer = LlamaTokenizerFast(vocab_file="./llama2/tokenizer.model")
>>> tokenizer.encode("Hello this is a test")
[1, 15043, 445, 338, 263, 1243]
build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]

Insert the special tokens to the input_ids, currently.

Parameters
  • token_ids_0 (List[int]) – List of IDs.

  • token_ids_1 (List[int], optional) – Second list of IDs for sequence pairs. Default: None , only use one sequence.

Returns

list of the tokens after inserting special tokens.

save_vocabulary(save_directory: str, filename_prefix: Optional[str] = None)[source]

Saves the vocabulary to the specified directory. This method is used to export the vocabulary file from the slow tokenizer.

Parameters
  • save_directory (str) – The directory where the vocabulary will be saved.

  • filename_prefix (str, optional) – The prefix for the saved files. Default: None .

Returns

A tuple containing the paths of the saved vocabulary files.

Raises
  • ValueError – Raises this exception if the vocabulary cannot be saved from

  • a fast tokenizer, or if the specified save directory does not exist.

slow_tokenizer_class

alias of mindformers.models.llama.llama_tokenizer.LlamaTokenizer

update_post_processor()[source]

Updates the underlying post processor with the current bos_token and eos_token.

Raises
  • ValueError – Raised if add_bos_token or add_eos_token is set but the

  • corresponding token is None.