mindspore.dataset.text
This module is to support text processing for NLP. It includes two parts: transforms and utils. transforms is a high performance NLP text processing module which is developed with ICU4C and cppjieba. utils provides some general methods for NLP text processing.
mindspore.dataset.text.transforms
API Name |
Description |
Note |
Tokenize a scalar tensor of UTF-8 string by specific rules. |
BasicTokenizer is not supported on Windows platform yet. |
|
Tokenizer used for Bert text process. |
BertTokenizer is not supported on Windows platform yet. |
|
Apply case fold operation on UTF-8 string tensor, which is aggressive that can convert more characters into lower case. |
CaseFold is not supported on Windows platform yet. |
|
Tokenize Chinese string into words based on dictionary. |
The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed. |
|
Look up a word into an id according to the input vocabulary table. |
None |
|
TensorOp to generate n-gram from a 1-D string Tensor. |
None |
|
Apply normalize operation on UTF-8 string tensor. |
NormalizeUTF8 is not supported on Windows platform yet. |
|
Callable class to be used for user-defined string tokenizer. |
None |
|
Replace UTF-8 string tensor with 'replace' according to regular expression 'pattern'. |
RegexReplace is not supported on Windows platform yet. |
|
Tokenize a scalar tensor of UTF-8 string by regex expression pattern. |
RegexTokenizer is not supported on Windows platform yet. |
|
Tokenize scalar token or 1-D tokens to tokens by sentencepiece. |
None |
|
TensorOp to construct a tensor from data (only 1-D for now), where each element in the dimension axis is a slice of data starting at the corresponding position, with a specified width. |
None |
|
Tensor operation to convert every element of a string tensor to a number. |
None |
|
Truncate a pair of rank-1 tensors such that the total length is less than max_length. |
None |
|
Tokenize a scalar tensor of UTF-8 string to Unicode characters. |
None |
|
Tokenize a scalar tensor of UTF-8 string on Unicode script boundaries. |
UnicodeScriptTokenizer is not supported on Windows platform yet. |
|
Tokenize a scalar tensor of UTF-8 string on ICU4C defined whitespaces, such as: ' ', '\t', '\r', '\n'. |
WhitespaceTokenizer is not supported on Windows platform yet. |
|
Tokenize scalar token or 1-D tokens to 1-D subword tokens. |
None |
mindspore.dataset.text.utils
API Name |
Description |
Note |
SentencePiece obiect that is used to segmentate words |
None |
|
Convert NumPy array of bytes to array of str by decoding each element based on charset encoding. |
None |
|
Convert NumPy array of str to array of bytes by encoding each element based on charset encoding. |
None |
|
Vocab object that is used to lookup a word. |
None |