Class WordpieceTokenizer

Inheritance Relationships

Base Type

Class Documentation

class WordpieceTokenizer : public mindspore::dataset::TensorTransform

Tokenize scalar token or 1-D tokens to 1-D sub-word tokens.

Public Functions

inline explicit WordpieceTokenizer(const std::shared_ptr<Vocab> &vocab, const std::string &suffix_indicator = "##", int32_t max_bytes_per_token = 100, const std::string &unknown_token = "[UNK]", bool with_offsets = false)

Constructor.

Example
/* Define operations */
std::vector<std::string> word_list = {"book", "apple", "rabbit"};
std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>();
Status s = Vocab::BuildFromVector(word_list, {}, true, &vocab);
auto tokenizer_op = text::WordpieceTokenizer(vocab);

/* dataset is an instance of Dataset object */
dataset = dataset->Map({tokenizer_op},   // operations
                       {"text"});        // input columns

Parameters
  • vocab[in] A Vocab object.

  • suffix_indicator[in] This parameter is used to show that the sub-word is the last part of a word (default=’##’).

  • max_bytes_per_token[in] Tokens exceeding this length will not be further split (default=100).

  • unknown_token[in] When a token cannot be found, return the token directly if ‘unknown_token’ is an empty string, else return the specified string (default=’[UNK]’).

  • with_offsets[in] whether to output offsets of tokens (default=false).

~WordpieceTokenizer() = default

Destructor.