mindspore.dataset.text.SentencePieceTokenizer
- class mindspore.dataset.text.SentencePieceTokenizer(mode, out_type)[源代码]
使用SentencePiece分词器对字符串进行分词。
- 参数:
mode (Union[str, SentencePieceVocab]) - SentencePiece模型。 如果输入是字符串类型,则代表要加载的SentencePiece模型文件的路径; 如果输入是SentencePieceVocab类型,则要求是构造好的
SentencePieceVocab
对象。out_type (
SPieceTokenizerOutType
) - 分词器输出的类型,可以取值为SPieceTokenizerOutType.STRING
或SPieceTokenizerOutType.INT
。SPieceTokenizerOutType.STRING
,表示SentencePice分词器的输出类型是str。SPieceTokenizerOutType.INT
,表示SentencePice分词器的输出类型是int。
- 异常:
TypeError - 参数 mode 的类型不是字符串或
mindspore.dataset.text.SentencePieceVocab
。TypeError - 参数 out_type 的类型不是
mindspore.dataset.text.SPieceTokenizerOutType
。
- 支持平台:
CPU
样例:
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType >>> >>> # Use the transform in dataset pipeline mode >>> numpy_slices_dataset = ds.NumpySlicesDataset(data=['Hello world'], column_names=["text"]) >>> # The paths to sentence_piece_vocab_file can be downloaded directly from the mindspore repository. Refer to >>> # https://gitee.com/mindspore/mindspore/blob/r2.3.q1/tests/ut/data/dataset/test_sentencepiece/vocab.txt >>> sentence_piece_vocab_file = "tests/ut/data/dataset/test_sentencepiece/vocab.txt" >>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 512, 0.9995, ... SentencePieceModel.UNIGRAM, {}) >>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING) >>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer) >>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True): ... print(item["text"]) ['▁H' 'e' 'l' 'lo' '▁w' 'o' 'r' 'l' 'd'] >>> >>> # Use the transform in eager mode >>> data = "Hello world" >>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 100, 0.9995, ... SentencePieceModel.UNIGRAM, {}) >>> output = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)(data) >>> print(output) ['▁' 'H' 'e' 'l' 'l' 'o' '▁' 'w' 'o' 'r' 'l' 'd']
- 教程样例: