比较与torchtext.data.functional.sentencepiece_numericalizer的功能差异
torchtext.data.functional.sentencepiece_numericalizer
torchtext.data.functional.sentencepiece_numericalizer(
sp_model
)
更多内容详见torchtext.data.functional.sentencepiece_numericalizer。
mindspore.dataset.text.SentencePieceTokenizer
class mindspore.dataset.text.SentencePieceTokenizer(
mode,
out_type
)
使用方式
PyTorch:依据传入的分词模型,返回将文本转换为id的生成器。
MindSpore:依据传入的分词模型,对输入的文本进行分词及标记;输出类型是string或int类型。
代码示例
import mindspore.dataset as ds
from mindspore.dataset import text
from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType
from torchtext.data.functional import sentencepiece_numericalizer
from torchtext.data.functional import load_sp_model
# In MindSpore, return tokenizer from vocab object.
sentence_piece_vocab_file = "/path/to/datasets/1.txt"
vocab = text.SentencePieceVocab.from_file(
[sentence_piece_vocab_file],
27,
0.9995,
SentencePieceModel.UNIGRAM,
{})
tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.INT)
text_file_dataset_dir = "/path/to/datasets/2.txt"
text_file_dataset1 = ds.TextFileDataset(dataset_files=text_file_dataset_dir)
text_file_dataset = text_file_dataset1.map(operations=tokenizer)
for item in text_file_dataset:
print(item[0])
break
# Out:
# [ 165 28 8 11 4746 1430 4]
root = "/path/to/m_user.model"
sp_model = load_sp_model(root)
# In torch, return the sentencepiece model according to the input model path.
sp_id_generator = sentencepiece_numericalizer(sp_model)
list_a = ["sentencepiece encode as pieces", "examples to try!"]
list(sp_id_generator(list_a))