Class RegexTokenizer

Inheritance Relationships

Base Type

Class Documentation

class RegexTokenizer : public mindspore::dataset::TensorTransform

Tokenize a scalar tensor of UTF-8 string by the regex expression pattern.

Public Functions

inline explicit RegexTokenizer(const std::string &delim_pattern, const std::string &keep_delim_pattern = "", bool with_offsets = false)

Constructor.

Parameters
  • delim_pattern[in] The pattern of regex delimiters.

  • keep_delim_pattern[in] The string matched with ‘delim_pattern’ can be kept as a token if it can be matched by ‘keep_delim_pattern’. The default value is an empty string (“”). which means that delimiters will not be kept as an output token (default=””).

  • with_offsets[in] Whether to output offsets of tokens (default=false).

样例
/* Define operations */
auto regex_op = text::RegexTokenizer("\\s+", "\\s+", false);

/* dataset is an instance of Dataset object */
dataset = dataset->Map({regex_op},   // operations
                       {"text"});    // input columns
~RegexTokenizer() = default

Destructor.