mindspore.dataset.text.BertTokenizer

class mindspore.dataset.text.BertTokenizer(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]', lower_case=False, keep_whitespace=False, normalization_form=NormalizeForm.NONE, preserve_unused_token=True, with_offsets=False)[source]

Tokenizer used for Bert text process.

Note

BertTokenizer is not supported on Windows platform yet.

Parameters
  • vocab (Vocab) – Vocabulary used to look up words.

  • suffix_indicator (str, optional) – Prefix flags used to indicate subword suffixes. Default: ‘##’.

  • max_bytes_per_token (int, optional) – The maximum length of tokenization, words exceeding this length will not be split. Default: 100.

  • unknown_token (str, optional) – The output for unknown words. When set to an empty string, the corresponding unknown word will be directly returned as the output. Otherwise, the set string will be returned as the output. Default: ‘[UNK]’.

  • lower_case (bool, optional) – Whether to perform lowercase processing on the text. If True, will fold the text to lower case and strip accented characters. If False, will only perform normalization on the text, with mode specified by normalization_form . Default: False.

  • keep_whitespace (bool, optional) – If True, the whitespace will be kept in the output. Default: False.

  • normalization_form (NormalizeForm, optional) –

    Unicode normalization forms , only valid when lower_case is False, can be NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD or NormalizeForm.NFKD. Default: NormalizeForm.NONE.

    • NormalizeForm.NONE, no normalization.

    • NormalizeForm.NFC, Canonical Decomposition, followed by Canonical Composition.

    • NormalizeForm.NFKC, Compatibility Decomposition, followed by Canonical Composition.

    • NormalizeForm.NFD, Canonical Decomposition.

    • NormalizeForm.NFKD, Compatibility Decomposition.

  • preserve_unused_token (bool, optional) – Whether to preserve special tokens. If True, will not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’, ‘[MASK]’. Default: True.

  • with_offsets (bool, optional) – Whether to return the offsets of tokens. Default: False.

Raises
Supported Platforms:

CPU

Examples

>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import NormalizeForm
>>>
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> vocab_list = ["床", "前", "明", "月", "光", "疑", "是", "地", "上", "霜", "举", "头", "望", "低",
...               "思", "故", "乡","繁", "體", "字", "嘿", "哈", "大", "笑", "嘻", "i", "am", "mak",
...               "make", "small", "mistake", "##s", "during", "work", "##ing", "hour", "😀", "😃",
...               "😄", "😁", "+", "/", "-", "=", "12", "28", "40", "16", " ", "I", "[CLS]", "[SEP]",
...               "[UNK]", "[PAD]", "[MASK]", "[unused1]", "[unused10]"]
>>> vocab = text.Vocab.from_list(vocab_list)
>>> tokenizer_op = text.BertTokenizer(vocab=vocab, suffix_indicator='##', max_bytes_per_token=100,
...                                   unknown_token='[UNK]', lower_case=False, keep_whitespace=False,
...                                   normalization_form=NormalizeForm.NONE, preserve_unused_token=True,
...                                   with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=True, then output three columns {["token", dtype=str],
>>> #                                                  ["offsets_start", dtype=uint32],
>>> #                                                  ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.BertTokenizer(vocab=vocab, suffix_indicator='##', max_bytes_per_token=100,
...                                   unknown_token='[UNK]', lower_case=False, keep_whitespace=False,
...                                   normalization_form=NormalizeForm.NONE, preserve_unused_token=True,
...                                   with_offsets=True)
>>> text_file_dataset_1 = text_file_dataset_1.map(operations=tokenizer_op, input_columns=["text"],
...                                               output_columns=["token", "offsets_start",
...                                                               "offsets_limit"])