mindformers.models.PreTrainedTokenizer
- class mindformers.models.PreTrainedTokenizer(**kwargs)[source]
Base class for all slow tokenizers.
Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.
This class also contain the added tokens in a unified way on top of all tokenizers so we don't have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece…).
Note
Initialize the basic configuration of the tokenizer.
Steps:
Initialize the parent class.
If the subclass has not been initialized with _added_tokens_decoder, it will be initialized.
Use the passed added_tokens_decoder to update the _added_tokens_decoder.
Add special tokens that are not in the vocabulary to the vocabulary in the same order as the SPECIAL_TOKENS_ATTRIBUTES in tokenizers.
Characteristic:
Ensure that all special tokens are added to the vocabulary, even if they were not originally in the vocabulary.
Use Trie structure to store tokens.
- Parameters
**kwargs (Any) –
Keyword arguments.
model_max_length (int, optional): The maximum length (in number of tokens) for the inputs to the transformer model. Set when the tokenizer is loaded with
from_pretrained()
based on the model'smax_model_input_sizes
attribute. Default:1e-30
.padding_side (str, optional): Specifies the side on which the model should have padding applied. Options are ['right', 'left']. The default value is picked from the class attribute of the same name.
truncation_side (str, optional): Specifies the side on which the model should have truncation applied. Options are ['right', 'left']. The default value is picked from the class attribute of the same name.
chat_template (str, optional): A Jinja template string used to format lists of chat messages. Default:
"{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
.model_input_names (List[str], optional): Lists the names of inputs accepted by the forward pass of the model, such as "token_type_ids" or "attention_mask". Defaults to values picked from the class attribute of the same name. Default:
None
.bos_token (Union[str, tokenizers.AddedToken], optional): Represents the beginning of a sentence and is associated with
self.bos_token
andself.bos_token_id
. Default:None
.eos_token (Union[str, tokenizers.AddedToken], optional): Represents the end of a sentence and is associated with
self.eos_token
andself.eos_token_id
. Default:None
.unk_token (Union[str, tokenizers.AddedToken], optional): Represents an out-of-vocabulary token and is associated with
self.unk_token
andself.unk_token_id
. Default:None
.sep_token (Union[str, tokenizers.AddedToken], optional): A special token separating two different sentences in the same input (used by BERT, for example) and is associated with
self.sep_token
andself.sep_token_id
. Default:None
.pad_token (Union[str, tokenizers.AddedToken], optional): Used to make arrays of tokens the same size for batching purposes and will be ignored by attention mechanisms or loss computation. It is associated with
self.pad_token
andself.pad_token_id
. Default:None
.cls_token (Union[str, tokenizers.AddedToken], optional): Represents the class of the input (used by BERT, for example) and is associated with
self.cls_token
andself.cls_token_id
. Default:None
.mask_token (Union[str, tokenizers.AddedToken], optional): Represents a masked token (used by masked-language modeling pretraining objectives like BERT) and is associated with
self.mask_token
andself.mask_token_id
. Default:None
.additional_special_tokens (Union[tuple, list, tokenizers.AddedToken], optional): Lists additional special tokens that are ensured to be skipped when decoding with
skip_special_tokens
set to True. They will be added at the end of the vocabulary if not already part of it. Default:None
.clean_up_tokenization_spaces (bool, optional): Determines whether to clean-up spaces that were added when splitting the input text during the tokenization process. Default:
True
.split_special_tokens (bool, optional): Specifies whether special tokens should be split during the tokenization process. This affects the internal state of the tokenizer. By default, special tokens are not split. For example, if '<s>' is the bos_token, then
tokenizer.tokenize("<s>")
results in ['<s>']. Ifsplit_special_tokens
is True, thentokenizer.tokenize("<s>")
would result in ['<','s', '>']. Default:False
.
- Returns
PreTrainedTokenizer instance.
Examples
>>> from mindformers import LlamaTokenizer >>> tokenizer = LlamaTokenizer.from_pretrained("llama2_7b") >>> res = tokenizer("hello world") >>> print(res) {'input_ids': [1, 27701, 924], 'attention_mask': [1, 1, 1]} >>> res = tokenizer("hello world", padding='max_length', max_length=10) >>> print(res) {'input_ids': [1, 27701, 924, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]} >>> res = tokenizer("hello world", return_tensors='ms') >>> print(res) {'input_ids': Tensor(shape=[3], dtype=Int32, value= [ 1, 27701, 924]), 'attention_mask': Tensor(shape=[3], dtype=Int32, value= [1, 1, 1])}
- property added_tokens_decoder: Dict[int, tokenizers.AddedToken]
Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.
- Returns
A dict, the added tokens.
- property added_tokens_encoder: Dict[str, int]
Returns the sorted mapping from string to index. The added tokens encoder is cached for performance optimisation in self._added_tokens_encoder for the slow tokenizers.
- Returns
A dict, the added tokens.
- convert_ids_to_tokens(ids: Union[int, List[int]], skip_special_tokens: bool = False)[source]
Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.
- convert_tokens_to_ids(tokens: Union[str, List[str]])[source]
Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.
- get_added_vocab()[source]
Returns the added tokens in the vocabulary as a dictionary of token to index.
- Returns
A dict, the added tokens.
- num_special_tokens_to_add(pair: bool = False)[source]
Returns the number of added tokens when encoding a sequence with special tokens.
Note
This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.
- Parameters
pair (bool, optional) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Default:
False
.- Returns
Number of special tokens added to sequences.
- prepare_for_tokenization(text: str, **kwargs)[source]
Performs any necessary transformations before tokenization.
This method should pop the arguments from kwargs and return the remaining kwargs as well. We test the kwargs at the end of the encoding process to be sure all the arguments have been used.
- Parameters
text (str) – The text to prepare.
kwargs (Any, optional) – Keyword arguments to use for the tokenization.
- Returns
Tuple[str, Dict[str, Any]], means the prepared text and the unused kwargs.
- tokenize(text: TextInput, pair: Optional[str] = None, add_special_tokens: bool = False, **kwargs)[source]
Converts a string in a sequence of tokens, using the tokenizer.
Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.
- Parameters
text (TextInput) – The sequence to be encoded.
pair (str, optional) – A second sequence to be encoded with the first. Default:
None
.add_special_tokens (bool, optional) – Whether to add the special tokens associated with the corresponding model. Default:
False
.kwargs (Any, optional) – Will be passed to the underlying model specific encode method. See details in ~PreTrainedTokenizerBase.__call__.
- Returns
tokenized_text, the list of tokens, type is List[str].