mindformers.models.PreTrainedTokenizerFast
- class mindformers.models.PreTrainedTokenizerFast(*args, **kwargs)[source]
Base class for all fast tokenizers (wrapping HuggingFace tokenizers library).
Handles all the shared methods for tokenization and special tokens, as well as methods for downloading/caching/loading pretrained tokenizers, as well as adding tokens to the vocabulary.
This class also contains the added tokens in a unified way on top of all tokenizers so we don't have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece…).
- Parameters
model_max_length (int, optional) – The maximum length (in number of tokens) for the inputs to the transformer model. Set when the tokenizer is loaded with
from_pretrained()
based on the model'smax_model_input_sizes
attribute. Default:1e-30
.padding_side (str, optional) – Specifies the side on which the model should have padding applied. Options are ['right', 'left']. The default value is picked from the class attribute of the same name.
truncation_side (str, optional) – Specifies the side on which the model should have truncation applied. Options are ['right', 'left']. The default value is picked from the class attribute of the same name.
chat_template (str, optional) – A Jinja template string used to format lists of chat messages. Default:
"{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
.model_input_names (List[str], optional) – Lists the names of inputs accepted by the forward pass of the model, such as "token_type_ids" or "attention_mask". Defaults to values picked from the class attribute of the same name. Default:
None
.bos_token (Union[str, tokenizers.AddedToken], optional) – Represents the beginning of a sentence and is associated with
self.bos_token
andself.bos_token_id
. Default:None
.eos_token (Union[str, tokenizers.AddedToken], optional) – Represents the end of a sentence and is associated with
self.eos_token
andself.eos_token_id
. Default:None
.unk_token (Union[str, tokenizers.AddedToken], optional) – Represents an out-of-vocabulary token and is associated with
self.unk_token
andself.unk_token_id
. Default:None
.sep_token (Union[str, tokenizers.AddedToken], optional) – A special token separating two different sentences in the same input (used by BERT, for example) and is associated with
self.sep_token
andself.sep_token_id
. Default:None
.pad_token (Union[str, tokenizers.AddedToken], optional) – Used to make arrays of tokens the same size for batching purposes and will be ignored by attention mechanisms or loss computation. It is associated with
self.pad_token
andself.pad_token_id
. Default:None
.cls_token (Union[str, tokenizers.AddedToken], optional) – Represents the class of the input (used by BERT, for example) and is associated with
self.cls_token
andself.cls_token_id
. Default:None
.mask_token (Union[str, tokenizers.AddedToken], optional) – Represents a masked token (used by masked-language modeling pretraining objectives like BERT) and is associated with
self.mask_token
andself.mask_token_id
. Default:None
.additional_special_tokens (Union[tuple, list, tokenizers.AddedToken], optional) – Lists additional special tokens that are ensured to be skipped when decoding with
skip_special_tokens
set to True. They will be added at the end of the vocabulary if not already part of it. Default:None
.clean_up_tokenization_spaces (bool, optional) – Determines whether to clean-up spaces that were added when splitting the input text during the tokenization process. Default:
True
.split_special_tokens (bool, optional) – Specifies whether special tokens should be split during the tokenization process. This affects the internal state of the tokenizer. By default, special tokens are not split. For example, if
<s>
is thebos_token
, thentokenizer.tokenize("<s>") = ['<s>']
. Ifsplit_special_tokens = True
, thentokenizer.tokenize("<s>")
would result in['<','s', '>']
. Default:False
.tokenizer_object (tokenizers.Tokenizer) – A
tokenizers.Tokenizer
object from tokenizers to instantiate from.tokenizer_file (str) – A path to a local JSON file representing a previously serialized
tokenizers.Tokenizer
object from tokenizers.
- Returns
PreTrainedTokenizerFast instance.
Examples
>>> from transformers import LlamaTokenizerFast >>> >>> tokenizer = LlamaTokenizerFast(vocab_file="./llama2/tokenizer.model") >>> tokenizer.encode("Hello this is a test") [1, 15043, 445, 338, 263, 1243]
- property added_tokens_decoder: Dict[int, tokenizers.AddedToken]
Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.
- Returns
A dict, the added tokens.
- property added_tokens_encoder: Dict[str, int]
Returns the sorted mapping from string to index. The added tokens encoder is cached for performance optimisation in self._added_tokens_encoder for the slow tokenizers.
- Returns
A dict, the added tokens.
- convert_ids_to_tokens(ids: Union[int, List[int]], skip_special_tokens: bool = False)[source]
Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.
- convert_tokens_to_ids(tokens: Union[str, List[str]])[source]
Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.
- get_added_vocab()[source]
Returns the added tokens in the vocabulary as a dictionary of token to index.
- Returns
A dict, the added tokens.
- num_special_tokens_to_add(pair: bool = False)[source]
Returns the number of added tokens when encoding a sequence with special tokens.
Note
This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.
- Parameters
pair (bool, optional) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Default:
False
.- Returns
int, Number of special tokens added to sequences.
- set_truncation_and_padding(padding_strategy: PaddingStrategy, truncation_strategy: TruncationStrategy, max_length: int, stride: int, pad_to_multiple_of: Optional[int])[source]
Define the truncation and the padding strategies for fast tokenizers (provided by HuggingFace tokenizers library) and restore the tokenizer settings after.
The provided tokenizer has no padding / truncation strategy before the managed section. If your tokenizer set a padding / truncation strategy before, then it will be reset to no padding / truncation when exiting the managed section.
- Parameters
padding_strategy (PaddingStrategy) – The kind of padding that will be applied to the input
truncation_strategy (TruncationStrategy) – The kind of truncation that will be applied to the input
max_length (int) – The maximum size of a sequence.
stride (int) – The stride to use when handling overflow.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta). Default:
None
.
- train_new_from_iterator(text_iterator, vocab_size, length=None, new_special_tokens=None, special_tokens_map=None, **kwargs)[source]
Trains a tokenizer on a new corpus with the same defaults (in terms of special tokens or tokenization pipeline) as the current one.
- Parameters
text_iterator (list) – The training corpus. Should be a generator of batches of texts, for instance a list of lists of texts if you have everything in memory.
vocab_size (int) – The size of the vocabulary you want for your tokenizer.
length (int, optional) – The total number of sequences in the iterator. This is used to provide meaningful progress tracking. Default:
None
.new_special_tokens (Union[list, AddedToken], optional) – A list of new special tokens to add to the tokenizer you are training. Default:
None
.special_tokens_map (dict, optional) – If you want to rename some of the special tokens this tokenizer uses, pass along a mapping old special token name to new special token name in this argument. Default:
None
.kwargs (Any, optional) – Additional keyword arguments.
- Returns
PreTrainedTokenizerFast, A new tokenizer of the same type as the original one, trained on text_iterator.