mindspore.dataset.text.SentencePieceTokenizer
- class mindspore.dataset.text.SentencePieceTokenizer(mode, out_type)[source]
Tokenize scalar token or 1-D tokens to tokens by sentencepiece.
- Parameters
mode (Union[str, SentencePieceVocab]) – SentencePiece model. If the input parameter is a file, it represents the path of SentencePiece mode to be loaded. If the input parameter is a SentencePieceVocab object, it should be constructed in advanced.
out_type (SPieceTokenizerOutType) –
The type of output, it can be
SPieceTokenizerOutType.STRING
,SPieceTokenizerOutType.INT
.SPieceTokenizerOutType.STRING
, means output type of SentencePice Tokenizer is string.SPieceTokenizerOutType.INT
, means output type of SentencePice Tokenizer is int.
- Raises
- Supported Platforms:
CPU
Examples
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType >>> >>> # Use the transform in dataset pipeline mode >>> numpy_slices_dataset = ds.NumpySlicesDataset(data=['Hello world'], column_names=["text"]) >>> # The paths to sentence_piece_vocab_file can be downloaded directly from the mindspore repository. Refer to >>> # https://gitee.com/mindspore/mindspore/blob/r2.3.q1/tests/ut/data/dataset/test_sentencepiece/vocab.txt >>> sentence_piece_vocab_file = "tests/ut/data/dataset/test_sentencepiece/vocab.txt" >>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 512, 0.9995, ... SentencePieceModel.UNIGRAM, {}) >>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING) >>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer) >>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True): ... print(item["text"]) ['▁H' 'e' 'l' 'lo' '▁w' 'o' 'r' 'l' 'd'] >>> >>> # Use the transform in eager mode >>> data = "Hello world" >>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 100, 0.9995, ... SentencePieceModel.UNIGRAM, {}) >>> output = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)(data) >>> print(output) ['▁' 'H' 'e' 'l' 'l' 'o' '▁' 'w' 'o' 'r' 'l' 'd']
- Tutorial Examples: