mindspore.dataset.text
This module is to support text processing for NLP. It includes two parts: transforms and utils. transforms is a high performance NLP text processing module which is developed with ICU4C and cppjieba. utils provides some general methods for NLP text processing.
Common imported modules in corresponding API examples are as follows:
import mindspore.dataset as ds
from mindspore.dataset import text
mindspore.dataset.text.transforms
API Name |
Description |
Note |
Tokenize a scalar tensor of UTF-8 string by specific rules. |
BasicTokenizer is not supported on Windows platform yet. |
|
Tokenizer used for Bert text process. |
BertTokenizer is not supported on Windows platform yet. |
|
Apply case fold operation on UTF-8 string tensor, which is aggressive that can convert more characters into lower case. |
CaseFold is not supported on Windows platform yet. |
|
Tokenize Chinese string into words based on dictionary. |
The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed. |
|
Look up a word into an id according to the input vocabulary table. |
None |
|
TensorOp to generate n-gram from a 1-D string Tensor. |
None |
|
Apply normalize operation on UTF-8 string tensor. |
NormalizeUTF8 is not supported on Windows platform yet. |
|
Class that applies user-defined string tokenizer into input string. |
None |
|
Replace a part of UTF-8 string tensor with given text according to regular expressions. |
RegexReplace is not supported on Windows platform yet. |
|
Tokenize a scalar tensor of UTF-8 string by regex expression pattern. |
RegexTokenizer is not supported on Windows platform yet. |
|
Tokenize scalar token or 1-D tokens to tokens by sentencepiece. |
None |
|
Construct a tensor from given data (only support 1-D for now), where each element in the dimension axis is a slice of data starting at the corresponding position, with a specified width. |
None |
|
Tensor operation to convert every element of a string tensor to a number. |
None |
|
Truncate a pair of rank-1 tensors such that the total length is less than max_length. |
None |
|
Tokenize a scalar tensor of UTF-8 string to Unicode characters. |
None |
|
Tokenize a scalar tensor of UTF-8 string based on Unicode script boundaries. |
UnicodeScriptTokenizer is not supported on Windows platform yet. |
|
Tokenize a scalar tensor of UTF-8 string on ICU4C defined whitespaces, such as: ' ', '\t', '\r', '\n'. |
WhitespaceTokenizer is not supported on Windows platform yet. |
|
Tokenize scalar token or 1-D tokens to 1-D subword tokens. |
None |
mindspore.dataset.text.utils
API Name |
Description |
Note |
An enumeration for JiebaTokenizer. |
None |
|
An enumeration for NormalizeUTF8. |
None |
|
An enumeration for SentencePieceModel. |
None |
|
SentencePiece object that is used to do words segmentation. |
None |
|
An enumeration for SPieceTokenizerLoadType. |
None |
|
An enumeration for SPieceTokenizerOutType. |
None |
|
Convert NumPy array of bytes to array of str by decoding each element based on charset encoding. |
None |
|
Convert NumPy array of str to array of bytes by encoding each element based on charset encoding. |
None |
|
Vocab object that is used to save pairs of words and ids. |
None |