mindspore.dataset.text.BertTokenizer

class mindspore.dataset.text.BertTokenizer(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]', lower_case=False, keep_whitespace=False, normalization_form=NormalizeForm.NONE, preserve_unused_token=True, with_offsets=False)[source]

Tokenizer used for Bert text process.

Note

BertTokenizer is not supported on Windows platform yet.

Parameters

vocab (Vocab) – Vocabulary used to look up words.
suffix_indicator (str, optional) – Prefix flags used to indicate subword suffixes. Default: '##'.
max_bytes_per_token (int, optional) – The maximum length of tokenization, words exceeding this length will not be split. Default: 100.
unknown_token (str, optional) – The output for unknown words. When set to an empty string, the corresponding unknown word will be directly returned as the output. Otherwise, the set string will be returned as the output. Default: '[UNK]'.
lower_case (bool, optional) – Whether to perform lowercase processing on the text. If True, will fold the text to lower case and strip accented characters. If False, will only perform normalization on the text, with mode specified by normalization_form . Default: False.
keep_whitespace (bool, optional) – If True, the whitespace will be kept in the output. Default: False.
normalization_form (NormalizeForm, optional) – The desired normalization form. See NormalizeForm for details on optional values. Default: NormalizeForm.NFKC .
preserve_unused_token (bool, optional) – Whether to preserve special tokens. If True, will not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’, ‘[MASK]’. Default: True.
with_offsets (bool, optional) – Whether to output the start and end offsets of each token in the original string. Default: False .

Raises

TypeError – If vocab is not of type mindspore.dataset.text.Vocab .
TypeError – If suffix_indicator is not of type str.
TypeError – If max_bytes_per_token is not of type int.
ValueError – If max_bytes_per_token is negative.
TypeError – If unknown_token is not of type str.
TypeError – If lower_case is not of type bool.
TypeError – If keep_whitespace is not of type bool.
TypeError – If normalization_form is not of type NormalizeForm .
TypeError – If preserve_unused_token is not of type bool.
TypeError – If with_offsets is not of type bool.

Supported Platforms:: CPU

Examples

>>> import numpy as np
>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import NormalizeForm
>>>
>>> # Use the transform in dataset pipeline mode
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=["床前明月光"], column_names=["text"])
>>>
>>> # 1) If with_offsets=False, default output one column {["text", dtype=str]}
>>> vocab_list = ["床", "前", "明", "月", "光", "疑", "是", "地", "上", "霜", "举", "头", "望", "低",
...               "思", "故", "乡", "繁", "體", "字", "嘿", "哈", "大", "笑", "嘻", "i", "am", "mak",
...               "make", "small", "mistake", "##s", "during", "work", "##ing", "hour", "+", "/",
...               "-", "=", "12", "28", "40", "16", " ", "I", "[CLS]", "[SEP]", "[UNK]", "[PAD]", "[MASK]"]
>>> vocab = text.Vocab.from_list(vocab_list)
>>> tokenizer_op = text.BertTokenizer(vocab=vocab, suffix_indicator='##', max_bytes_per_token=100,
...                                   unknown_token='[UNK]', lower_case=False, keep_whitespace=False,
...                                   normalization_form=NormalizeForm.NONE, preserve_unused_token=True,
...                                   with_offsets=False)
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer_op)
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["text"])
['床' '前' '明' '月' '光']
>>>
>>> # 2) If with_offsets=True, then output three columns {["token", dtype=str],
>>> #                                                     ["offsets_start", dtype=uint32],
>>> #                                                     ["offsets_limit", dtype=uint32]}
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=["床前明月光"], column_names=["text"])
>>> tokenizer_op = text.BertTokenizer(vocab=vocab, suffix_indicator='##', max_bytes_per_token=100,
...                                   unknown_token='[UNK]', lower_case=False, keep_whitespace=False,
...                                   normalization_form=NormalizeForm.NONE, preserve_unused_token=True,
...                                   with_offsets=True)
>>> numpy_slices_dataset = numpy_slices_dataset.map(
...     operations=tokenizer_op,
...     input_columns=["text"],
...     output_columns=["token", "offsets_start", "offsets_limit"])
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["token"], item["offsets_start"], item["offsets_limit"])
['床' '前' '明' '月' '光'] [ 0  3  6  9 12] [ 3  6  9 12 15]
>>>
>>> # Use the transform in eager mode
>>> data = "床前明月光"
>>> vocab = text.Vocab.from_list(vocab_list)
>>> tokenizer_op = text.BertTokenizer(vocab=vocab)
>>> output = tokenizer_op(data)
>>> print(output)
['床' '前' '明' '月' '光']

Tutorial Examples:

Illustration of text transforms