Class BasicTokenizer

Inheritance Relationships

Base Type

Class Documentation

class BasicTokenizer : public mindspore::dataset::TensorTransform

Tokenize a scalar tensor of UTF-8 string by specific rules.

Note

BasicTokenizer is not supported on the Windows platform yet.

Public Functions

explicit BasicTokenizer(bool lower_case = false, bool keep_whitespace = false, NormalizeForm normalize_form = NormalizeForm::kNone, bool preserve_unused_token = true, bool with_offsets = false)

Constructor.

Parameters
  • lower_case[in] If true, apply CaseFold, NormalizeUTF8 (NFD mode) and RegexReplace operations to the input text to fold the text to lower case and strip accents characters. If false, only apply the NormalizeUTF8(‘normalization_form’ mode) operation to the input text (default=false).

  • keep_whitespace[in] If true, the whitespace will be kept in output tokens (default=false).

  • normalize_form[in] This parameter is used to specify a specific normalize mode. This is only effective when ‘lower_case’ is false. See NormalizeUTF8 for details (default=NormalizeForm::kNone).

  • preserve_unused_token[in] If true, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’ and ‘[MASK]’ (default=true).

  • with_offsets[in] Whether to output offsets of tokens (default=false).

Example
/* Define operations */
auto tokenizer_op = text::BasicTokenizer();

/* dataset is an instance of Dataset object */
dataset = dataset->Map({tokenizer_op},   // operations
                       {"text"});        // input columns
~BasicTokenizer() = default

Destructor.