mindspore.dataset.text
This module is to support text processing for NLP. It includes two parts: text transforms and utils. text transforms is a high performance NLP text processing module which is developed with ICU4C and cppjieba. utils provides some general methods for NLP text processing.
Common imported modules in corresponding API examples are as follows:
import mindspore.dataset as ds
import mindspore.dataset.text as text
Descriptions of common data processing terms are as follows:
TensorOperation, the base class of all data processing operations implemented in C++.
TextTensorOperation, the base class of all text processing operations. It is a derived class of TensorOperation.
Transforms
API Name |
Description |
Note |
Tokenize the input UTF-8 encoded string by specific rules. |
BasicTokenizer is not supported on Windows platform yet. |
|
Tokenizer used for Bert text process. |
BertTokenizer is not supported on Windows platform yet. |
|
Apply case fold operation on UTF-8 string tensor, which is aggressive that can convert more characters into lower case. |
CaseFold is not supported on Windows platform yet. |
|
Filter Wikipedia XML dumps to "clean" text consisting only of lowercase letters (a-z, converted from A-Z), and spaces (never consecutive). |
FilterWikipediaXML is not supported on Windows platform yet. |
|
Tokenize Chinese string into words based on dictionary. |
The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed. |
|
Look up a word into an id according to the input vocabulary table. |
None |
|
Generate n-gram from a 1-D string Tensor. |
None |
|
Apply normalize operation on UTF-8 string tensor. |
NormalizeUTF8 is not supported on Windows platform yet. |
|
Class that applies user-defined string tokenizer into input string. |
None |
|
Replace a part of UTF-8 string tensor with given text according to regular expressions. |
RegexReplace is not supported on Windows platform yet. |
|
Tokenize a scalar tensor of UTF-8 string by regex expression pattern. |
RegexTokenizer is not supported on Windows platform yet. |
|
Tokenize scalar token or 1-D tokens to tokens by sentencepiece. |
None |
|
Construct a tensor from given data (only support 1-D for now), where each element in the dimension axis is a slice of data starting at the corresponding position, with a specified width. |
None |
|
Tensor operation to convert every element of a string tensor to a number. |
None |
|
Look up a token into vectors according to the input vector table. |
None |
|
Truncate a pair of rank-1 tensors such that the total length is less than max_length. |
None |
|
Tokenize a scalar tensor of UTF-8 string to Unicode characters. |
None |
|
Tokenize a scalar tensor of UTF-8 string based on Unicode script boundaries. |
UnicodeScriptTokenizer is not supported on Windows platform yet. |
|
Tokenize a scalar tensor of UTF-8 string on ICU4C defined whitespaces, such as: ' ', '\t', '\r', '\n'. |
WhitespaceTokenizer is not supported on Windows platform yet. |
|
Tokenize the input text to subword tokens. |
None |
Utilities
API Name |
Description |
Note |
CharNGram object that is used to map tokens into pre-trained vectors. |
None |
|
FastText object that is used to map tokens into vectors. |
None |
|
GloVe object that is used to map tokens into vectors. |
None |
|
An enumeration for JiebaTokenizer. |
None |
|
Enumeration class for Unicode normalization forms . |
None |
|
An enumeration for SentencePieceModel. |
None |
|
SentencePiece object that is used to do words segmentation. |
None |
|
An enumeration for loading type of SentencePieceTokenizer. |
None |
|
An enumeration for SPieceTokenizerOutType. |
None |
|
Vectors object that is used to map tokens into vectors. |
None |
|
Vocab object that is used to save pairs of words and ids. |
None |
|
Convert NumPy array of str to array of bytes by encoding each element based on charset encoding. |
None |
|
Convert NumPy array of bytes to array of str by decoding each element based on charset encoding. |
None |