mindformers.models.LlamaTokenizer

class mindformers.models.LlamaTokenizer(vocab_file, unk_token='<unk>', bos_token='<s>', eos_token='</s>', pad_token='<unk>', sp_model_kwargs: Optional[Dict[str, Any]] = None, add_bos_token=True, add_eos_token=False, clean_up_tokenization_spaces=False, legacy=True, **kwargs)[source]

Construct a Llama tokenizer based on byte-level Byte-Pair-Encoding.

The default padding token is unset as there isno padding token in the original model.

Parameters

vocab_file (str) – Path to the vocabulary file.
unk_token (Union[str, AddedToken], optional) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. Default: "<unk>".
bos_token (Union[str, AddedToken], optional) – The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. Default: "<s>".
eos_token (Union[str, AddedToken], optional) – The end of sequence token. Default: "</s>".
pad_token (Union[str, AddedToken], optional) – A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation. Default: "<unk>".
sp_model_kwargs (Dict[str, Any], optional) – Will be passed to the SentencePieceProcessor.__init__() method. The Python wrapper for SentencePiece can be used, among other things, to set keys below. Default: None , an empty dict will be passed.
add_bos_token (bool, optional) – Whether to add an bos_token at the start of sequences. Default: True.
add_eos_token (bool, optional) – Whether to add an eos_token at the end of sequences. Default: False.
clean_up_tokenization_spaces (bool, optional) – Whether to clean up spaces after decoding. Cleanup includes removing potential artifacts like extra spaces. Default: False.
legacy (bool, optional) – Whether the legacy behavior of the tokenizer should be used. Default: True.

Returns

A LlamaTokenizer instance.

Examples

>>> from mindformers import LlamaTokenizer
>>> tokenizer = LlamaTokenizer.from_pretrained("llama2_7b")
>>> res = tokenizer("hello world")
>>> print(res)
{'input_ids': [1, 27701, 924], 'attention_mask': [1, 1, 1]}
>>> res = tokenizer("hello world", padding='max_length', max_length=10)
>>> print(res)
{'input_ids': [1, 27701, 924, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}
>>> res = tokenizer("hello world", return_tensors='ms')
>>> print(res)
{'input_ids': Tensor(shape=[3], dtype=Int32, value= [    1, 27701,  924]), 'attention_mask': Tensor(shape=[3],
dtype=Int32, value= [1, 1, 1])}

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)[source]

Insert the special tokens to the input_ids. Current this method will add bos_token and 'eos_token' to the head and end of sequence respectively.

Parameters

token_ids_0 (List[int]) – List of token IDs.
token_ids_1 (List[int], optional) – Second list of token IDs for sequence pairs. Default: None , only use one sequence.

Returns

list of the tokens after inserting special tokens.

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)[source]

Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT sequence pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
|   first sequence   | second sequence  |

if token_ids_1 is None, then only returns the first portion of the mask (0s).

Parameters

token_ids_0 (List[int]) – List of token IDs.
token_ids_1 (List[int], optional) – Second list of token IDs for sequence pairs. Default: None , only use one sequence.

Returns

A List consists of integer 0 and 1 according to the given sequence(s), where 0 for tokens in token_ids_0 and 1 for tokens in token_ids_1.

get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False)[source]

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

Parameters

token_ids_0 (List[int]) – List of token IDs.
token_ids_1 (List[int], optional) – Second list of token IDs for sequence pairs. Default: None , only use one sequence.
already_has_special_tokens (bool, optional) – Whether the token list is already formatted with special tokens for the model. Default: False.

Returns

A list consists of integer 0 and 1, where 1 for a special token and 0 for a sequence token.