mindformers.models.LlamaTokenizerFast
- class mindformers.models.LlamaTokenizerFast(vocab_file=None, tokenizer_file=None, clean_up_tokenization_spaces=False, unk_token='<unk>', bos_token='<s>', eos_token='</s>', add_bos_token=True, add_eos_token=False, use_default_system_prompt=False, **kwargs)[source]
Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.
This uses notably ByteFallback and no normalization.
Note
Currently, the llama_tokenizer_fast process supports only the 'right' padding mode. padding_side = "right"
Note
If you want to change the bos_token or the eos_token, make sure to specify them when initializing the model, or call tokenizer.update_post_processor() to make sure that the post-processing is correctly done (otherwise the values of the first token and final token of an encoded sequence will not be correct).
- Parameters
vocab_file (str, optional) – SentencePiece file (generally has a .model extension) that contains the vocabulary necessary to instantiate a tokenizer. Default:
None
.tokenizer_file (str, optional) – Tokenizers file (generally has a .json extension) that contains everything needed to load the tokenizer. Default:
None
.clean_up_tokenization_spaces (bool, optional) – Whether to clean-up spaces after decoding, cleanup consists in removing potential artifacts like extra spaces. Default:
False
.unk_token (Union[str, tokenizers.AddedToken], optional) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. Default:
"<unk>"
.bos_token (Union[str, tokenizers.AddedToken], optional) – The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. Default:
"<s>"
.eos_token (Union[str, tokenizers.AddedToken], optional) – The end of sequence token. Default:
"</s>"
.add_bos_token (bool, optional) – Whether to add an bos_token at the start of sequences. Default:
True
.add_eos_token (bool, optional) – Whether to add an eos_token at the end of sequences. Default:
False
.use_default_system_prompt (bool, optional) – Whether the default system prompt for Llama should be used. Default:
False
.
- Returns
LlamaTokenizer, a LlamaTokenizer instance.
Examples
>>> from transformers import LlamaTokenizerFast >>> >>> tokenizer = LlamaTokenizerFast(vocab_file="./llama2/tokenizer.model") >>> tokenizer.encode("Hello this is a test") [1, 15043, 445, 338, 263, 1243]
- build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]
Insert the special tokens to the input_ids, currently.
- save_vocabulary(save_directory: str, filename_prefix: Optional[str] = None)[source]
Saves the vocabulary to the specified directory. This method is used to export the vocabulary file from the slow tokenizer.
- Parameters
- Returns
A tuple containing the paths of the saved vocabulary files.
- Raises
ValueError – Raises this exception if the vocabulary cannot be saved from
a fast tokenizer, or if the specified save directory does not exist. –
- slow_tokenizer_class
alias of
mindformers.models.llama.llama_tokenizer.LlamaTokenizer
- update_post_processor()[source]
Updates the underlying post processor with the current bos_token and eos_token.
- Raises
ValueError – Raised if add_bos_token or add_eos_token is set but the
corresponding token is None. –