mindspore.dataset.text.BasicTokenizer

class mindspore.dataset.text.BasicTokenizer(lower_case=False, keep_whitespace=False, normalization_form=NormalizeForm.NONE, preserve_unused_token=True, with_offsets=False)[source]

Tokenize the input UTF-8 encoded string by specific rules.

Note

BasicTokenizer is not supported on Windows platform yet.

Parameters
  • lower_case (bool, optional) – Whether to perform lowercase processing on the text. If True, will fold the text to lower case and strip accented characters. If False, will only perform normalization on the text, with mode specified by normalization_form . Default: False.

  • keep_whitespace (bool, optional) – If True, the whitespace will be kept in the output. Default: False.

  • normalization_form (NormalizeForm, optional) – The desired normalization form. See NormalizeForm for details on optional values. Default: NormalizeForm.NFKC .

  • preserve_unused_token (bool, optional) – Whether to preserve special tokens. If True, will not split special tokens like '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.

  • with_offsets (bool, optional) – Whether to output the start and end offsets of each token in the original string. Default: False .

Raises
Supported Platforms:

CPU

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import NormalizeForm
>>>
>>> # Use the transform in dataset pipeline mode
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=['Welcome     To   BeiJing!'], column_names=["text"])
>>>
>>> # 1) If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.BasicTokenizer(lower_case=False,
...                                    keep_whitespace=False,
...                                    normalization_form=NormalizeForm.NONE,
...                                    preserve_unused_token=True,
...                                    with_offsets=False)
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer_op)
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["text"])
['Welcome' 'To' 'BeiJing' '!']
>>>
>>> # 2) If with_offsets=True, then output three columns {["token", dtype=str],
>>> #                                                     ["offsets_start", dtype=uint32],
>>> #                                                     ["offsets_limit", dtype=uint32]}
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=['Welcome     To   BeiJing!'], column_names=["text"])
>>> tokenizer_op = text.BasicTokenizer(lower_case=False,
...                                    keep_whitespace=False,
...                                    normalization_form=NormalizeForm.NONE,
...                                    preserve_unused_token=True,
...                                    with_offsets=True)
>>> numpy_slices_dataset = numpy_slices_dataset.map(
...     operations=tokenizer_op, input_columns=["text"],
...     output_columns=["token", "offsets_start", "offsets_limit"])
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["token"], item["offsets_start"], item["offsets_limit"])
['Welcome' 'To' 'BeiJing' '!'] [ 0 12 17 24] [ 7 14 24 25]
>>>
>>> # Use the transform in eager mode
>>> data = 'Welcome     To   BeiJing!'
>>> output = text.BasicTokenizer()(data)
>>> print(output)
['Welcome' 'To' 'BeiJing' '!']
Tutorial Examples: