mindformers.models.LlamaTokenizer
- class mindformers.models.LlamaTokenizer(vocab_file, unk_token='<unk>', bos_token='<s>', eos_token='</s>', pad_token='<unk>', sp_model_kwargs: Optional[Dict[str, Any]] = None, add_bos_token=True, add_eos_token=False, clean_up_tokenization_spaces=False, legacy=True, **kwargs)[source]
Construct a Llama tokenizer based on byte-level Byte-Pair-Encoding.
The default padding token is unset as there isno padding token in the original model.
- Parameters
vocab_file (str) – Path to the vocabulary file.
unk_token (Union[str, AddedToken], optional) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. Default:
"<unk>"
.bos_token (Union[str, AddedToken], optional) – The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. Default:
"<s>"
.eos_token (Union[str, AddedToken], optional) – The end of sequence token. Default:
"</s>"
.pad_token (Union[str, AddedToken], optional) – A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation. Default:
"<unk>"
.sp_model_kwargs (Dict[str, Any], optional) – Will be passed to the SentencePieceProcessor.__init__() method. The Python wrapper for SentencePiece can be used, among other things, to set keys below. Default:
None
, an empty dict will be passed.add_bos_token (bool, optional) – Whether to add an bos_token at the start of sequences. Default:
True
.add_eos_token (bool, optional) – Whether to add an eos_token at the end of sequences. Default:
False
.clean_up_tokenization_spaces (bool, optional) – Whether to clean up spaces after decoding. Cleanup includes removing potential artifacts like extra spaces. Default:
False
.legacy (bool, optional) – Whether the legacy behavior of the tokenizer should be used. Default:
True
.
- Returns
A LlamaTokenizer instance.
Examples
>>> from mindformers import LlamaTokenizer >>> tokenizer = LlamaTokenizer.from_pretrained("llama2_7b") >>> res = tokenizer("hello world") >>> print(res) {'input_ids': [1, 27701, 924], 'attention_mask': [1, 1, 1]} >>> res = tokenizer("hello world", padding='max_length', max_length=10) >>> print(res) {'input_ids': [1, 27701, 924, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]} >>> res = tokenizer("hello world", return_tensors='ms') >>> print(res) {'input_ids': Tensor(shape=[3], dtype=Int32, value= [ 1, 27701, 924]), 'attention_mask': Tensor(shape=[3], dtype=Int32, value= [1, 1, 1])}
- build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)[source]
Insert the special tokens to the input_ids. Current this method will add bos_token and 'eos_token' to the head and end of sequence respectively.
- create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)[source]
Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT sequence pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 | first sequence | second sequence |
if token_ids_1 is None, then only returns the first portion of the mask (0s).
- Parameters
- Returns
A List consists of integer 0 and 1 according to the given sequence(s), where 0 for tokens in token_ids_0 and 1 for tokens in token_ids_1.
- get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False)[source]
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.
- Parameters
token_ids_0 (List[int]) – List of token IDs.
token_ids_1 (List[int], optional) – Second list of token IDs for sequence pairs. Default:
None
, only use one sequence.already_has_special_tokens (bool, optional) – Whether the token list is already formatted with special tokens for the model. Default:
False
.
- Returns
A list consists of integer 0 and 1, where 1 for a special token and 0 for a sequence token.