mindspore.dataset.text.transforms.BasicTokenizer
- class mindspore.dataset.text.transforms.BasicTokenizer(lower_case=False, keep_whitespace=False, normalization_form=NormalizeForm.NONE, preserve_unused_token=True, with_offsets=False)[source]
Tokenize a scalar tensor of UTF-8 string by specific rules.
Note
BasicTokenizer is not supported on Windows platform yet.
- Parameters
lower_case (bool, optional) – If True, apply CaseFold, NormalizeUTF8 with NFD mode, RegexReplace operation on input text to fold the text to lower case and strip accents characters. If False, only apply NormalizeUTF8 operation with the specified mode on input text (default=False).
keep_whitespace (bool, optional) – If True, the whitespace will be kept in output tokens (default=False).
normalization_form (NormalizeForm, optional) – Used to specify a specific normalize mode. This is only effective when lower_case is False. See NormalizeUTF8 for details (default=NormalizeForm.NONE).
preserve_unused_token (bool, optional) – If True, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’, ‘[MASK]’ (default=True).
with_offsets (bool, optional) – If or not output offsets of tokens (default=False).
Examples
>>> from mindspore.dataset.text import NormalizeForm >>> >>> # If with_offsets=False, default output one column {["text", dtype=str]} >>> tokenizer_op = text.BasicTokenizer(lower_case=False, ... keep_whitespace=False, ... normalization_form=NormalizeForm.NONE, ... preserve_unused_token=True, ... with_offsets=False) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op) >>> # If with_offsets=False, then output three columns {["token", dtype=str], >>> # ["offsets_start", dtype=uint32], >>> # ["offsets_limit", dtype=uint32]} >>> tokenizer_op = text.BasicTokenizer(lower_case=False, ... keep_whitespace=False, ... normalization_form=NormalizeForm.NONE, ... preserve_unused_token=True, ... with_offsets=True) >>> text_file_dataset_1 = text_file_dataset_1.map(operations=tokenizer_op, input_columns=["text"], ... output_columns=["token", "offsets_start", ... "offsets_limit"], ... column_order=["token", "offsets_start", ... "offsets_limit"])