mindspore.dataset.text.transforms.SentencePieceTokenizer

class mindspore.dataset.text.transforms.SentencePieceTokenizer(mode, out_type)[source]

Tokenize scalar token or 1-D tokens to tokens by sentencepiece.

Parameters

mode (Union[str, SentencePieceVocab]) – If the input parameter is a file, then its type should be string. If the input parameter is a SentencePieceVocab object, then its type should be SentencePieceVocab.
out_type (SPieceTokenizerOutType) –
The type of output, it can be any of [SPieceTokenizerOutType.STRING, SPieceTokenizerOutType.INT].
- SPieceTokenizerOutType.STRING, means output type of SentencePice Tokenizer is string.
- SPieceTokenizerOutType.INT, means output type of SentencePice Tokenizer is int.

Examples

>>> from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType
>>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file"
>>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 5000, 0.9995,
...                                           SentencePieceModel.UNIGRAM, {})
>>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer)