Function differences with torchtext.data.functional.load_sp_model
torchtext.data.functional.load_sp_model
torchtext.data.functional.load_sp_model(
spm
)
For more information, see torchtext.data.functional.load_sp_model.
mindspore.dataset.text.SentencePieceVocab
class mindspore.dataset.text.SentencePieceVocab
For more information, see mindspore.dataset.text.SentencePieceVocab.
Differences
PyTorch: Load a sentencepiece model for file. Input the path of the sentence fragment model and output the sentence fragment model.
MindSpore: SentencePiece object that is used to perform words segmentation. The input can be a dataset object or a glossary file.
Code Example
import mindspore.dataset as ds
from mindspore.dataset import text
from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType
from torchtext.data.functional import load_sp_model
# In MindSpore, return tokenizer from vocab object.
sentence_piece_vocab_file = "/path/to/test_sentencepiece/botchan.txt"
vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 500, 0.9995,
SentencePieceModel.WORD, {})
tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)
text_file_dataset_dir = "/path/to/testTokenizerData/sentencepiece_tokenizer.txt"
text_file_dataset = ds.TextFileDataset(dataset_files=text_file_dataset_dir)
text_file_dataset = text_file_dataset.map(operations=tokenizer)
for i in text_file_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
ret = i["text"]
for value in ret:
print(value)
# Out:
# ▁I
# ▁saw
# ▁a
# ▁girl
# ▁with
# ▁a
# ▁telescope.
# In torch, return the sentencepiece model according to the input model path.
sp_model = load_sp_model("m_user.model")
sp_model = load_sp_model(open("m_user.model", 'rb'))