mindformers.AutoTokenizer

View Source On Gitee
class mindformers.AutoTokenizer[source]

This is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the from_pretrained class method. This class cannot be instantiated directly using __init__() (throws an error).

Examples

>>> from mindformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
classmethod from_pretrained(yaml_name_or_path, *args, **kwargs)[source]

From pretrain method, which instantiates a tokenizer by a directory or model_id from modelers.cn.

Warning

The API is experimental and may have some slight breaking changes in the next releases.

Parameters
  • yaml_name_or_path (str) – a folder containing YAML file, a folder containing JSON file, or a model_id from modelers.cn. The last two are experimental features.

  • args (Any, optional) – Will be passed along to the underlying tokenizer __init__() method. Only works in experimental mode.

  • kwargs (Dict[str, Any], optional) – The values in kwargs of any keys which are configuration attributes will be used to override the loaded values.

Returns

A tokenizer.

static register(config_class, slow_tokenizer_class=None, fast_tokenizer_class=None, exist_ok=False)[source]

Register new tokenizers for this class.

Warning

The API is experimental and may have some slight breaking changes in the next releases.

Parameters
  • config_class (PretrainedConfig) – The model config class.

  • slow_tokenizer_class (PreTrainedTokenizer, optional) – The slow_tokenizer class. Default: None .

  • fast_tokenizer_class (PreTrainedTokenizerFast, optional) – The fast_tokenizer class. Default: None .

  • exist_ok (bool, optional) – If set to True, no error will be raised even if config_class already exists. Default: False .