mindspore.dataset.text.SentencePieceTokenizer
- class mindspore.dataset.text.SentencePieceTokenizer(mode, out_type)[source]
Tokenize scalar token or 1-D tokens to tokens by sentencepiece.
- Parameters
mode (Union[str, SentencePieceVocab]) – SentencePiece model. If the input parameter is a file, it represents the path of SentencePiece mode to be loaded. If the input parameter is a SentencePieceVocab object, it should be constructed in advanced.
out_type (SPieceTokenizerOutType) –
The type of output, it can be
SPieceTokenizerOutType.STRING
,SPieceTokenizerOutType.INT
.SPieceTokenizerOutType.STRING
, means output type of SentencePice Tokenizer is string.SPieceTokenizerOutType.INT
, means output type of SentencePice Tokenizer is int.
- Raises
- Supported Platforms:
CPU
Examples
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType >>> >>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file" >>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 5000, 0.9995, ... SentencePieceModel.UNIGRAM, {}) >>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING) >>> >>> text_file_list = ["/path/to/text_file_dataset_file"] >>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer)
- Tutorial Examples: