mindspore.dataset.text.transforms.BasicTokenizer

class mindspore.dataset.text.transforms.BasicTokenizer(lower_case=False, keep_whitespace=False, normalization_form=NormalizeForm.NONE, preserve_unused_token=True, with_offsets=False)[source]

Tokenize a scalar tensor of UTF-8 string by specific rules.

Note

BasicTokenizer is not supported on Windows platform yet.

Parameters
  • lower_case (bool, optional) – If True, apply CaseFold, NormalizeUTF8 with NFD mode, RegexReplace operation on input text to fold the text to lower case and strip accents characters. If False, only apply NormalizeUTF8 operation with the specified mode on input text (default=False).

  • keep_whitespace (bool, optional) – If True, the whitespace will be kept in output tokens (default=False).

  • normalization_form (NormalizeForm, optional) –

    Used to specify a specific normalize mode (default=NormalizeForm.NONE). This is only effective when lower_case is False. It can be any of [NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD, NormalizeForm.NFKD].

    • NormalizeForm.NONE, do nothing for input string tensor.

    • NormalizeForm.NFC, normalize with Normalization Form C.

    • NormalizeForm.NFKC, normalize with Normalization Form KC.

    • NormalizeForm.NFD, normalize with Normalization Form D.

    • NormalizeForm.NFKD, normalize with Normalization Form KD.

  • preserve_unused_token (bool, optional) – If True, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’, ‘[MASK]’ (default=True).

  • with_offsets (bool, optional) – Whether or not output offsets of tokens (default=False).

Examples

>>> from mindspore.dataset.text import NormalizeForm
>>>
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.BasicTokenizer(lower_case=False,
...                                    keep_whitespace=False,
...                                    normalization_form=NormalizeForm.NONE,
...                                    preserve_unused_token=True,
...                                    with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=True, then output three columns {["token", dtype=str],
>>> #                                                   ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.BasicTokenizer(lower_case=False,
...                                    keep_whitespace=False,
...                                    normalization_form=NormalizeForm.NONE,
...                                    preserve_unused_token=True,
...                                    with_offsets=True)
>>> text_file_dataset_1 = text_file_dataset_1.map(operations=tokenizer_op, input_columns=["text"],
...                                               output_columns=["token", "offsets_start",
...                                                               "offsets_limit"],
...                                               column_order=["token", "offsets_start",
...                                                             "offsets_limit"])