mindspore.dataset.text.SentencePieceTokenizer

View Source On Gitee
class mindspore.dataset.text.SentencePieceTokenizer(mode, out_type)[source]

Tokenize scalar token or 1-D tokens to tokens by sentencepiece.

Parameters
  • mode (Union[str, SentencePieceVocab]) – SentencePiece model. If the input parameter is a file, it represents the path of SentencePiece mode to be loaded. If the input parameter is a SentencePieceVocab object, it should be constructed in advanced.

  • out_type (SPieceTokenizerOutType) –

    The type of output, it can be SPieceTokenizerOutType.STRING, SPieceTokenizerOutType.INT.

    • SPieceTokenizerOutType.STRING, means output type of SentencePice Tokenizer is string.

    • SPieceTokenizerOutType.INT, means output type of SentencePice Tokenizer is int.

Raises
  • TypeError – If mode is not of type string or SentencePieceVocab.

  • TypeError – If out_type is not of type SPieceTokenizerOutType.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType
>>>
>>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file"
>>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 5000, 0.9995,
...                                           SentencePieceModel.UNIGRAM, {})
>>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)
>>>
>>> text_file_list = ["/path/to/text_file_dataset_file"]
>>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer)
Tutorial Examples: