比较与torchtext.data.functional.sentencepiece_tokenizer的差异
torchtext.data.functional.sentencepiece_tokenizer
torchtext.data.functional.sentencepiece_tokenizer(
sp_model
)
mindspore.dataset.text.SentencePieceTokenizer
class mindspore.dataset.text.SentencePieceTokenizer(
mode,
out_type
)
使用方式
PyTorch:依据传入的分词模型,返回将文本转换为字符串的生成器。
MindSpore:依据传入的分词模型,对输入的文本进行分词及标记;输出类型是string或int类型。
分类 |
子类 |
PyTorch |
MindSpore |
差异 |
---|---|---|---|---|
参数 |
参数1 |
sp_model |
mode |
MindSpore支持SentencePiece词汇表或分词模型地址 |
参数2 |
- |
out_type |
分词器输出的类型 |
代码示例
from download import download
url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model"
download(url, './sentencepiece.bpe.model', replace=True)
# PyTorch
from torchtext.data.functional import load_sp_model, sentencepiece_tokenizer
list_a = "sentencepiece encode as pieces"
model = load_sp_model("./sentencepiece.bpe.model")
sp_id_generator = sentencepiece_tokenizer(model)
print(list(sp_id_generator([list_a])))
# Out: [['▁sentence', 'piece', '▁en', 'code', '▁as', '▁pieces']]
# MindSpore
import mindspore.dataset.text as text
sp_id_generator = text.SentencePieceTokenizer("./sentencepiece.bpe.model", out_type=text.SPieceTokenizerOutType.STRING)
print(list(sp_id_generator(list_a)))
# Out: ['▁sentence', 'piece', '▁en', 'code', '▁as', '▁pieces']