Class BertTokenizer

Inheritance Relationships

Base Type

Class Documentation

class BertTokenizer : public mindspore::dataset::TensorTransform

A tokenizer used for Bert text process.

Note

BertTokenizer is not supported on the Windows platform yet.

Public Functions

inline explicit BertTokenizer(const std::shared_ptr<Vocab> &vocab, const std::string &suffix_indicator = "##", int32_t max_bytes_per_token = 100, const std::string &unknown_token = "[UNK]", bool lower_case = false, bool keep_whitespace = false, const NormalizeForm normalize_form = NormalizeForm::kNone, bool preserve_unused_token = true, bool with_offsets = false)

Constructor.

Parameters
  • vocab[in] A Vocab object.

  • suffix_indicator[in] This parameter is used to show that the sub-word is the last part of a word (default=’##’).

  • max_bytes_per_token[in] Tokens exceeding this length will not be further split (default=100).

  • unknown_token[in] When a token cannot be found, return the token directly if ‘unknown_token’ is an empty string, else return the specified string (default=’[UNK]’).

  • lower_case[in] If true, apply CaseFold, NormalizeUTF8 (NFD mode) and RegexReplace operations to the input text to fold the text to lower case and strip accents characters. If false, only apply the NormalizeUTF8(‘normalization_form’ mode) operation to the input text (default=false).

  • keep_whitespace[in] If true, the whitespace will be kept in output tokens (default=false).

  • normalize_form[in] This parameter is used to specify a specific normalize mode. This is only effective when ‘lower_case’ is false. See NormalizeUTF8 for details (default=NormalizeForm::kNone).

  • preserve_unused_token[in] If true, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’ and ‘[MASK]’ (default=true).

  • with_offsets[in] Whether to output offsets of tokens (default=false).

Example
/* Define operations */
std::vector<std::string> list = {"a", "b", "c", "d"};
std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>();
Status s = Vocab::BuildFromVector(list, {}, true, &vocab);
auto tokenizer_op = text::BertTokenizer(vocab);

/* dataset is an instance of Dataset object */
dataset = dataset->Map({tokenizer_op},   // operations
                       {"text"});        // input columns
BertTokenizer(const std::shared_ptr<Vocab> &vocab, const std::vector<char> &suffix_indicator, int32_t max_bytes_per_token, const std::vector<char> &unknown_token, bool lower_case, bool keep_whitespace, NormalizeForm normalize_form, bool preserve_unused_token, bool with_offsets)

Constructor.

Parameters
  • vocab[in] A Vocab object.

  • suffix_indicator[in] This parameter is used to show that the sub-word is the last part of a word (default=’##’).

  • max_bytes_per_token[in] Tokens exceeding this length will not be further split (default=100).

  • unknown_token[in] When a token cannot be found, return the token directly if ‘unknown_token’ is an empty string, else return the specified string (default=’[UNK]’).

  • lower_case[in] If true, apply CaseFold, NormalizeUTF8 (NFD mode) and RegexReplace operations to the input text to fold the text to lower case and strip accents characters. If false, only apply the NormalizeUTF8(‘normalization_form’ mode) operation to the input text (default=false).

  • keep_whitespace[in] If true, the whitespace will be kept in output tokens (default=false).

  • normalize_form[in] This parameter is used to specify a specific normalize mode. This is only effective when ‘lower_case’ is false. See NormalizeUTF8 for details (default=NormalizeForm::kNone).

  • preserve_unused_token[in] If true, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’ and ‘[MASK]’ (default=true).

  • with_offsets[in] Whether to output offsets of tokens (default=false).

~BertTokenizer() = default

Destructor.