mindspore.dataset.text.transforms.BertTokenizer

class mindspore.dataset.text.transforms.BertTokenizer(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]', lower_case=False, keep_whitespace=False, normalization_form=NormalizeForm.NONE, preserve_unused_token=True, with_offsets=False)[source]

Tokenizer used for Bert text process.

Note

BertTokenizer is not supported on Windows platform yet.

Parameters

vocab (Vocab) – A vocabulary object.
suffix_indicator (str, optional) – Used to show that the subword is the last part of a word (default=’##’).
max_bytes_per_token (int, optional) – Tokens exceeding this length will not be further split (default=100).
unknown_token (str, optional) – When an unknown token is found, return the token directly if unknown_token is an empty string, else return unknown_token instead (default=’[UNK]’).
lower_case (bool, optional) – If True, apply CaseFold, NormalizeUTF8 with NFD mode, RegexReplace operation on input text to fold the text to lower case and strip accented characters. If False, only apply NormalizeUTF8 operation with the specified mode on input text (default=False).
keep_whitespace (bool, optional) – If True, the whitespace will be kept in out tokens (default=False).
normalization_form (NormalizeForm, optional) – Used to specify a specific normalize mode, only effective when lower_case is False. See NormalizeUTF8 for details (default=NormalizeForm.NONE).
preserve_unused_token (bool, optional) – If True, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’, ‘[MASK]’ (default=True).
with_offsets (bool, optional) – If or not output offsets of tokens (default=False).

Examples

>>> from mindspore.dataset.text import NormalizeForm
>>>
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> vocab_list = ["床", "前", "明", "月", "光", "疑", "是", "地", "上", "霜", "举", "头", "望", "低",
...               "思", "故", "乡","繁", "體", "字", "嘿", "哈", "大", "笑", "嘻", "i", "am", "mak",
...               "make", "small", "mistake", "##s", "during", "work", "##ing", "hour", "😀", "😃",
...               "😄", "😁", "+", "/", "-", "=", "12", "28", "40", "16", " ", "I", "[CLS]", "[SEP]",
...               "[UNK]", "[PAD]", "[MASK]", "[unused1]", "[unused10]"]
>>> vocab = text.Vocab.from_list(vocab_list)
>>> tokenizer_op = text.BertTokenizer(vocab=vocab, suffix_indicator='##', max_bytes_per_token=100,
...                                   unknown_token='[UNK]', lower_case=False, keep_whitespace=False,
...                                   normalization_form=NormalizeForm.NONE, preserve_unused_token=True,
...                                   with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=False, then output three columns {["token", dtype=str],
>>> #                                                   ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.BertTokenizer(vocab=vocab, suffix_indicator='##', max_bytes_per_token=100,
...                                   unknown_token='[UNK]', lower_case=False, keep_whitespace=False,
...                                   normalization_form=NormalizeForm.NONE, preserve_unused_token=True,
...                                   with_offsets=True)
>>> text_file_dataset_1 = text_file_dataset_1.map(operations=tokenizer_op, input_columns=["text"],
...                                               output_columns=["token", "offsets_start",
...                                                               "offsets_limit"],
...                                               column_order=["token", "offsets_start",
...                                                             "offsets_limit"])