mindformers.models.PreTrainedTokenizer

View Source On Gitee
class mindformers.models.PreTrainedTokenizer(**kwargs)[source]

Base class for all slow tokenizers.

Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.

This class also contain the added tokens in a unified way on top of all tokenizers so we don't have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece…).

Note

Initialize the basic configuration of the tokenizer.

Steps:

  1. Initialize the parent class.

  2. If the subclass has not been initialized with _added_tokens_decoder, it will be initialized.

  3. Use the passed added_tokens_decoder to update the _added_tokens_decoder.

  4. Add special tokens that are not in the vocabulary to the vocabulary in the same order as the SPECIAL_TOKENS_ATTRIBUTES in tokenizers.

Characteristic:

  1. Ensure that all special tokens are added to the vocabulary, even if they were not originally in the vocabulary.

  2. Use Trie structure to store tokens.

Parameters
  • model_max_length (int, optional) – The maximum length (in number of tokens) for the inputs to the transformer model. Set when the tokenizer is loaded with from_pretrained() based on the model's max_model_input_sizes attribute. Default: 1e-30 .

  • padding_side (str, optional) – Specifies the side on which the model should have padding applied. Options are ['right', 'left']. The default value is picked from the class attribute of the same name.

  • truncation_side (str, optional) – Specifies the side on which the model should have truncation applied. Options are ['right', 'left']. The default value is picked from the class attribute of the same name.

  • chat_template (str, optional) – A Jinja template string used to format lists of chat messages. Default: "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}" .

  • model_input_names (List[str], optional) – Lists the names of inputs accepted by the forward pass of the model, such as "token_type_ids" or "attention_mask". Defaults to values picked from the class attribute of the same name. Default: None .

  • bos_token (Union[str, tokenizers.AddedToken], optional) – Represents the beginning of a sentence and is associated with self.bos_token and self.bos_token_id. Default: None .

  • eos_token (Union[str, tokenizers.AddedToken], optional) – Represents the end of a sentence and is associated with self.eos_token and self.eos_token_id. Default: None .

  • unk_token (Union[str, tokenizers.AddedToken], optional) – Represents an out-of-vocabulary token and is associated with self.unk_token and self.unk_token_id. Default: None .

  • sep_token (Union[str, tokenizers.AddedToken], optional) – A special token separating two different sentences in the same input (used by BERT, for example) and is associated with self.sep_token and self.sep_token_id. Default: None .

  • pad_token (Union[str, tokenizers.AddedToken], optional) – Used to make arrays of tokens the same size for batching purposes and will be ignored by attention mechanisms or loss computation. It is associated with self.pad_token and self.pad_token_id. Default: None .

  • cls_token (Union[str, tokenizers.AddedToken], optional) – Represents the class of the input (used by BERT, for example) and is associated with self.cls_token and self.cls_token_id. Default: None .

  • mask_token (Union[str, tokenizers.AddedToken], optional) – Represents a masked token (used by masked-language modeling pretraining objectives like BERT) and is associated with self.mask_token and self.mask_token_id. Default: None .

  • additional_special_tokens (Union[tuple, list, tokenizers.AddedToken], optional) – Lists additional special tokens that are ensured to be skipped when decoding with skip_special_tokens set to True. They will be added at the end of the vocabulary if not already part of it. Default: None .

  • clean_up_tokenization_spaces (bool, optional) – Determines whether to clean-up spaces that were added when splitting the input text during the tokenization process. Default: True .

  • split_special_tokens (bool, optional) – Specifies whether special tokens should be split during the tokenization process. This affects the internal state of the tokenizer. By default, special tokens are not split. For example, if '<s>' is the bos_token, then tokenizer.tokenize("<s>") results in ['<s>']. If split_special_tokens is True, then tokenizer.tokenize("<s>") would result in ['<','s', '>']. Default: False .

Returns

PreTrainedTokenizer instance.

Examples

>>> from mindformers import LlamaTokenizer
>>> tokenizer = LlamaTokenizer.from_pretrained("llama_7b")
>>> res = tokenizer("hello world")
>>> print(res)
{'input_ids': [1, 27701, 924], 'attention_mask': [1, 1, 1]}
>>> res = tokenizer("hello world", padding='max_length', max_length=10)
>>> print(res)
{'input_ids': [1, 27701, 924, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}
>>> res = tokenizer("hello world", return_tensors='ms')
>>> print(res)
{'input_ids': Tensor(shape=[3], dtype=Int32, value= [    1, 27701,  924]), 'attention_mask': Tensor(shape=[3],
dtype=Int32, value= [1, 1, 1])}
property added_tokens_decoder: Dict[int, tokenizers.AddedToken]

Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.

Returns

A dict, the added tokens.

property added_tokens_encoder: Dict[str, int]

Returns the sorted mapping from string to index. The added tokens encoder is cached for performance optimisation in self._added_tokens_encoder for the slow tokenizers.

Returns

A dict, the added tokens.

convert_ids_to_tokens(ids: Union[int, List[int]], skip_special_tokens: bool = False)[source]

Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.

Parameters
  • ids (Union[int, list[int]]) – The token id (or token ids) to convert to tokens.

  • skip_special_tokens (bool, optional) – Whether to remove special tokens in the decoding. Default: False .

Returns

Str or List[str], the decoded token(s).

convert_tokens_to_ids(tokens: Union[str, List[str]])[source]

Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.

Parameters

tokens (Union[str, List[str]]) – One or several token(s) to convert to token id(s).

Returns

ids , the token id or list of token ids, type is int or List[int].

get_added_vocab()[source]

Returns the added tokens in the vocabulary as a dictionary of token to index.

Returns

A dict, the added tokens.

num_special_tokens_to_add(pair: bool = False)[source]

Returns the number of added tokens when encoding a sequence with special tokens.

Note

This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.

Parameters

pair (bool, optional) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Default: False .

Returns

Number of special tokens added to sequences.

prepare_for_tokenization(text: str, **kwargs)[source]

Performs any necessary transformations before tokenization.

This method should pop the arguments from kwargs and return the remaining kwargs as well. We test the kwargs at the end of the encoding process to be sure all the arguments have been used.

Parameters
  • text (str) – The text to prepare.

  • kwargs (Any, optional) – Keyword arguments to use for the tokenization.

Returns

Tuple[str, Dict[str, Any]], means the prepared text and the unused kwargs.

tokenize(text: TextInput, pair: Optional[str] = None, add_special_tokens: bool = False, **kwargs)[source]

Converts a string in a sequence of tokens, using the tokenizer.

Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.

Parameters
  • text (TextInput) – The sequence to be encoded.

  • pair (str, optional) – A second sequence to be encoded with the first. Default: None .

  • add_special_tokens (bool, optional) – Whether to add the special tokens associated with the corresponding model. Default: False .

  • kwargs (Any, optional) – Will be passed to the underlying model specific encode method. See details in ~PreTrainedTokenizerBase.__call__.

Returns

tokenized_text, the list of tokens, type is List[str].