mindspore.dataset.text

This module is to support text processing for NLP. It includes two parts: transforms and utils. transforms is a high performance NLP text processing module which is developed with ICU4C and cppjieba. utils provides some general methods for NLP text processing.

mindspore.dataset.text.transforms

API Name

Description

Note

mindspore.dataset.text.transforms.BasicTokenizer

Tokenize a scalar tensor of UTF-8 string by specific rules.

BasicTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.transforms.BertTokenizer

Tokenizer used for Bert text process.

BertTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.transforms.CaseFold

Apply case fold operation on UTF-8 string tensor, which is aggressive that can convert more characters into lower case.

CaseFold is not supported on Windows platform yet.

mindspore.dataset.text.transforms.JiebaTokenizer

Tokenize Chinese string into words based on dictionary.

The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed.

mindspore.dataset.text.transforms.Lookup

Look up a word into an id according to the input vocabulary table.

None

mindspore.dataset.text.transforms.Ngram

TensorOp to generate n-gram from a 1-D string Tensor.

None

mindspore.dataset.text.transforms.NormalizeUTF8

Apply normalize operation on UTF-8 string tensor.

NormalizeUTF8 is not supported on Windows platform yet.

mindspore.dataset.text.transforms.PythonTokenizer

Class that apply user-defined string tokenizer into input string.

None

mindspore.dataset.text.transforms.RegexReplace

Replace a part of UTF-8 string tensor with given text according to regular expressions.

RegexReplace is not supported on Windows platform yet.

mindspore.dataset.text.transforms.RegexTokenizer

Tokenize a scalar tensor of UTF-8 string by regex expression pattern.

RegexTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.transforms.SentencePieceTokenizer

Tokenize scalar token or 1-D tokens to tokens by sentencepiece.

None

mindspore.dataset.text.transforms.SlidingWindow

Construct a tensor from given data (only support 1-D for now), where each element in the dimension axis is a slice of data starting at the corresponding position, with a specified width.

None

mindspore.dataset.text.transforms.ToNumber

Tensor operation to convert every element of a string tensor to a number.

None

mindspore.dataset.text.transforms.TruncateSequencePair

Truncate a pair of rank-1 tensors such that the total length is less than max_length.

None

mindspore.dataset.text.transforms.UnicodeCharTokenizer

Tokenize a scalar tensor of UTF-8 string to Unicode characters.

None

mindspore.dataset.text.transforms.UnicodeScriptTokenizer

Tokenize a scalar tensor of UTF-8 string based on Unicode script boundaries.

UnicodeScriptTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.transforms.WhitespaceTokenizer

Tokenize a scalar tensor of UTF-8 string on ICU4C defined whitespaces, such as: ' ', '\t', '\r', '\n'.

WhitespaceTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.transforms.WordpieceTokenizer

Tokenize scalar token or 1-D tokens to 1-D subword tokens.

None

mindspore.dataset.text.utils

API Name

Description

Note

mindspore.dataset.text.utils.SentencePieceVocab

SentencePiece object that is used to do words segmentation.

None

mindspore.dataset.text.utils.to_str

Convert NumPy array of bytes to array of str by decoding each element based on charset encoding.

None

mindspore.dataset.text.utils.to_bytes

Convert NumPy array of str to array of bytes by encoding each element based on charset encoding.

None

mindspore.dataset.text.utils.Vocab

Vocab object that is used to lookup a word.

None