mindformers.models.ChatGLM3Tokenizer
- class mindformers.models.ChatGLM3Tokenizer(vocab_file, bos_token='<sop>', eos_token='<eop>', end_token='</s>', mask_token='[MASK]', gmask_token='[gMASK]', pad_token='<pad>', unk_token='<unk>', **kwargs)[source]
Construct a ChatGLM3 tokenizer. Based on byte-level Byte-Pair-Encoding.
- Parameters
vocab_file (str) – The vocabulary file path.
bos_token (str, tokenizers.AddedToken) – The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. Default: "<sop>" .
eos_token (str, tokenizers.AddedToken) – The end of sequence token. Default: "</s>" .
end_token (str, tokenizers.AddedToken) – The end of sequence token. Default: "</s>" .
mask_token (str, tokenizers.AddedToken) – The masked token. Default: "[MASK]" .
gmask_token (str, tokenizers.AddedToken) – The special masked token. Default:
"[gMASK]"
.pad_token (str, tokenizers.AddedToken) – A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation. Default: "<pad>" .
unk_token (str, tokenizers.AddedToken) – The unknown token. Default: "<unk>" .
**kwargs – Other kwargs that will be passed into the base class of the Tokenizer.
- Returns
A ChatGLM3Tokenizer instance.
Examples
>>> from mindformers import AutoTokenizer >>> tokenize = AutoTokenizer.from_pretrained('glm2_6b') >>> tokenize("你好") {'input_ids': [64790, 64792, 36474, 54591], 'attention_mask': [1, 1, 1, 1]} >>> from mindformers import ChatGLM3Tokenizer >>> tokenizer = ChatGLM3Tokenizer('tokenizer.model') >>> prompts = ["晚上睡不着应该怎么办"] >>> token_id = tokenizer(prompts) >>> input_ids = token_id['input_ids'] >>> print(input_ids) [[64790, 64792, 30910, 32820, 54266, 31876, 35153]] >>> response = tokenizer.decode(input_ids) >>> print(response) ['晚上睡不着应该怎么办']