Class BertTokenizer
Defined in File text.h
Inheritance Relationships
Base Type
public mindspore::dataset::TensorTransform
(Class TensorTransform)
Class Documentation
-
class BertTokenizer : public mindspore::dataset::TensorTransform
A tokenizer used for Bert text process.
Note
BertTokenizer is not supported on the Windows platform yet.
Public Functions
Constructor.
- Parameters
vocab – [in] A Vocab object.
suffix_indicator – [in] This parameter is used to show that the sub-word is the last part of a word (default=’##’).
max_bytes_per_token – [in] Tokens exceeding this length will not be further split (default=100).
unknown_token – [in] When a token cannot be found, return the token directly if ‘unknown_token’ is an empty string, else return the specified string (default=’[UNK]’).
lower_case – [in] If true, apply CaseFold, NormalizeUTF8 (NFD mode) and RegexReplace operations to the input text to fold the text to lower case and strip accents characters. If false, only apply the NormalizeUTF8(‘normalization_form’ mode) operation to the input text (default=false).
keep_whitespace – [in] If true, the whitespace will be kept in output tokens (default=false).
normalize_form – [in] This parameter is used to specify a specific normalize mode. This is only effective when ‘lower_case’ is false. See NormalizeUTF8 for details (default=NormalizeForm::kNone).
preserve_unused_token – [in] If true, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’ and ‘[MASK]’ (default=true).
with_offsets – [in] Whether to output offsets of tokens (default=false).
样例/* Define operations */ std::vector<std::string> list = {"a", "b", "c", "d"}; std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>(); Status s = Vocab::BuildFromVector(list, {}, true, &vocab); auto tokenizer_op = text::BertTokenizer(vocab); /* dataset is an instance of Dataset object */ dataset = dataset->Map({tokenizer_op}, // operations {"text"}); // input columns
Constructor.
- Parameters
vocab – [in] A Vocab object.
suffix_indicator – [in] This parameter is used to show that the sub-word is the last part of a word (default=’##’).
max_bytes_per_token – [in] Tokens exceeding this length will not be further split (default=100).
unknown_token – [in] When a token cannot be found, return the token directly if ‘unknown_token’ is an empty string, else return the specified string (default=’[UNK]’).
lower_case – [in] If true, apply CaseFold, NormalizeUTF8 (NFD mode) and RegexReplace operations to the input text to fold the text to lower case and strip accents characters. If false, only apply the NormalizeUTF8(‘normalization_form’ mode) operation to the input text (default=false).
keep_whitespace – [in] If true, the whitespace will be kept in output tokens (default=false).
normalize_form – [in] This parameter is used to specify a specific normalize mode. This is only effective when ‘lower_case’ is false. See NormalizeUTF8 for details (default=NormalizeForm::kNone).
preserve_unused_token – [in] If true, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’ and ‘[MASK]’ (default=true).
with_offsets – [in] Whether to output offsets of tokens (default=false).
-
~BertTokenizer() = default
Destructor.