mindformers.models.glm2.ChatGLM4Tokenizer

View Source On Gitee
class mindformers.models.glm2.ChatGLM4Tokenizer(vocab_file, clean_up_tokenization_spaces=False, encode_special_tokens=False, eos_token='<|endoftext|>', pad_token='<|endoftext|>', **kwargs)[source]

Construct a ChatGLM4 tokenizer. Based on byte-level Byte-Pair-Encoding.

Parameters
  • vocab_file (str) – The vocabulary file path.

  • clean_up_tokenization_spaces (bool, optional) – Whether to delete redundant spaces. Default: False .

  • encode_special_tokens (bool, optional) – Whether to encode the special tokens. Default: False .

  • eos_token (str, tokenizers.AddedToken) – The end of sequence token. Default: "<|endoftext|>" .

  • pad_token (str, tokenizers.AddedToken) – A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation. Default: "<|endoftext|>" .

  • **kwargs – Other kwargs that will be passed into the base class of the Tokenizer.

Returns

A ChatGLM4Tokenizer instance.

Examples

>>> from mindformers import ChatGLM4Tokenizer
>>> tokenizer = ChatGLM4Tokenizer('tokenizer.model')
>>> prompts = ["晚上睡不着应该怎么办"]
>>> token_id = tokenizer(prompts)
>>> input_ids = token_id['input_ids']
>>> print(input_ids)
[[151331, 151333, 101160, 120410, 99379, 103298]]
>>> response = tokenizer.decode(input_ids)
>>> print(response)
['晚上睡不着应该怎么办']