mindformers.models.ChatGLM3Tokenizer

View Source On Gitee
class mindformers.models.ChatGLM3Tokenizer(vocab_file, bos_token='<sop>', eos_token='<eop>', end_token='</s>', mask_token='[MASK]', gmask_token='[gMASK]', pad_token='<pad>', unk_token='<unk>', **kwargs)[source]

Construct a ChatGLM3 tokenizer. Based on byte-level Byte-Pair-Encoding.

Parameters
  • vocab_file (str) – The vocabulary file path.

  • bos_token (str, tokenizers.AddedToken) – The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. Default: "<sop>" .

  • eos_token (str, tokenizers.AddedToken) – The end of sequence token. Default: "</s>" .

  • end_token (str, tokenizers.AddedToken) – The end of sequence token. Default: "</s>" .

  • mask_token (str, tokenizers.AddedToken) – The masked token. Default: "[MASK]" .

  • gmask_token (str, tokenizers.AddedToken) – The special masked token. Default: "[gMASK]".

  • pad_token (str, tokenizers.AddedToken) – A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation. Default: "<pad>" .

  • unk_token (str, tokenizers.AddedToken) – The unknown token. Default: "<unk>" .

  • **kwargs – Other kwargs that will be passed into the base class of the Tokenizer.

Returns

A ChatGLM3Tokenizer instance.

Examples

>>> from mindformers import AutoTokenizer
>>> tokenize = AutoTokenizer.from_pretrained('glm2_6b')
>>> tokenize("你好")
{'input_ids': [64790, 64792, 36474, 54591], 'attention_mask': [1, 1, 1, 1]}
>>> from mindformers import ChatGLM3Tokenizer
>>> tokenizer = ChatGLM3Tokenizer('tokenizer.model')
>>> prompts = ["晚上睡不着应该怎么办"]
>>> token_id = tokenizer(prompts)
>>> input_ids = token_id['input_ids']
>>> print(input_ids)
[[64790, 64792, 30910, 32820, 54266, 31876, 35153]]
>>> response = tokenizer.decode(input_ids)
>>> print(response)
['晚上睡不着应该怎么办']