mindspore.dataset.text

This module is to support text processing for NLP. It includes two parts: transforms and utils. transforms is a high performance NLP text processing module which is developed with ICU4C and cppjieba. utils provides some general methods for NLP text processing.

mindspore.dataset.text.transforms

API Name

Description

Note

mindspore.dataset.text.transforms.BasicTokenizer

Tokenize a scalar tensor of UTF-8 string by specific rules.

BasicTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.transforms.BertTokenizer

Tokenizer used for Bert text process.

BertTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.transforms.CaseFold

Apply case fold operation on UTF-8 string tensor.

CaseFold is not supported on Windows platform yet.

mindspore.dataset.text.transforms.JiebaTokenizer

Tokenize Chinese string into words based on dictionary.

The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed.

mindspore.dataset.text.transforms.Lookup

Lookup operator that looks up a word to an id.

None

mindspore.dataset.text.transforms.Ngram

TensorOp to generate n-gram from a 1-D string Tensor.

None

mindspore.dataset.text.transforms.NormalizeUTF8

Apply normalize operation on UTF-8 string tensor.

NormalizeUTF8 is not supported on Windows platform yet.

mindspore.dataset.text.transforms.PythonTokenizer

Callable class to be used for user-defined string tokenizer.

None

mindspore.dataset.text.transforms.RegexReplace

Replace UTF-8 string tensor with ‘replace’ according to regular expression ‘pattern’.

RegexReplace is not supported on Windows platform yet.

mindspore.dataset.text.transforms.RegexTokenizer

Tokenize a scalar tensor of UTF-8 string by regex expression pattern.

RegexTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.transforms.SentencePieceTokenizer

Tokenize scalar token or 1-D tokens to tokens by sentencepiece.

None

mindspore.dataset.text.transforms.SlidingWindow

TensorOp to construct a tensor from data (only 1-D for now), where each element in the dimension axis is a slice of data starting at the corresponding position, with a specified width.

None

mindspore.dataset.text.transforms.ToNumber

Tensor operation to convert every element of a string tensor to a number.

None

mindspore.dataset.text.transforms.TruncateSequencePair

Truncate a pair of rank-1 tensors such that the total length is less than max_length.

None

mindspore.dataset.text.transforms.UnicodeCharTokenizer

Tokenize a scalar tensor of UTF-8 string to Unicode characters.

None

mindspore.dataset.text.transforms.UnicodeScriptTokenizer

Tokenize a scalar tensor of UTF-8 string on Unicode script boundaries.

UnicodeScriptTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.transforms.WhitespaceTokenizer

Tokenize a scalar tensor of UTF-8 string on ICU4C defined whitespaces, such as: ‘ ‘, ‘\t’, ‘\r’, ‘\n’.

WhitespaceTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.transforms.WordpieceTokenizer

Tokenize scalar token or 1-D tokens to 1-D subword tokens.

None

mindspore.dataset.text.utils

API Name

Description

Note

mindspore.dataset.text.utils.SentencePieceVocab

SentencePiece obiect that is used to segmentate words

None

mindspore.dataset.text.utils.to_str

Convert NumPy array of bytes to array of str by decoding each element based on charset encoding.

None

mindspore.dataset.text.utils.to_bytes

Convert NumPy array of str to array of bytes by encoding each element based on charset encoding.

None

mindspore.dataset.text.utils.Vocab

Vocab object that is used to lookup a word.

None